如何用选项列出列表，其中哪一个是Regex，R

我在这里遇到了正则问题。我有三个句子：

s1 <- "today john jack and joe go to the beach"
s2 <- "today joe and john go to the beach"
s3 <- "today jack and joe go to the beach"

我想知道约翰今天是否要去海滩，无论其他两个家伙。因此，这三个句子的结果应该是（按顺序）

TRUE
TRUE
FALSE

我尝试使用R中的Grepl进行此操作。

print(grepl("today (john|jack|joe|and| )+go to the beach", s1))
print(grepl("today (john|jack|joe|and| )+go to the beach", s2))
print(grepl("today (john|jack|joe|and| )+go to the beach", s3))

当我三明治"约翰"（强制性词）中，这有帮助，在两个相同的量词之间，可选单词：

print(grepl("today (jack|joe|and| )*john(jack|joe|and| )*go to the beach", s1))
print(grepl("today (jack|joe|and| )*john(jack|joe|and| )*go to the beach", s2))
print(grepl("today (jack|joe|and| )*john(jack|joe|and| )*go to the beach", s3))

但是，这显然是不良的编码（重复）。任何人都有更优雅的解决方案？

您可以在不知道那里可能出现的地方使用.*：

s <- c("today john jack and joe go to the beach", "today joe and john go to the beach", "today jack and joe go to the beach")
grepl("today .*\bjohn\b.* go to the beach", s)
## => [1]  TRUE  TRUE FALSE

请参阅在线r demo

b字界用于匹配john的整个单词。

edit ：如果您有一个可能出现的单词的预定白名单，则可能会出现today和go之间，则不能仅仅匹配任何内容，则需要使用交替组 列出了所有这些替代方案，并且 - 如果您真的想缩短模式 - 使用PCRE REGEX中的子例程呼叫：

> grepl("today ((?:jack|joe|and| )*)john(?1)\bgo to the beach", s, perl=TRUE)
[1]  TRUE  TRUE FALSE

请参阅正则演示。

在这里，替代方案包裹在一个量化的非捕获组中，整个组都用"技术" 捕获组包裹，可以用(?1) subroutine呼叫（1）递归。表示捕获组＃1）。

您是否需要验证句子的其余部分？因为否则我会很简单：

sentences = c(s1, s2, s3)
grepl('\bjohn\b', sentences)
# [1]  TRUE  TRUE FALSE

这执行较少的验证，但更明显地表达了声明的意图：" John是否出现在句子中？"

相关内容

最新更新

热门标签：