R正则表达式用于正查找以匹配以下内容



我在r中有一个数据框,我想匹配并保留行

  • "woman"是第一个或
  • 句子中的第二个单词,或
  • 如果它是句子中的第三个单词并且前面有"no,"不,"或"不!">
phrases_with_woman <- structure(list(phrase = c("woman get degree", "woman obtain justice", 
"session woman vote for member", "woman have to end", "woman have no existence", 
"woman lose right", "woman be much", "woman mix at dance", "woman vote as member", 
"woman have power", "woman act only", "she be woman", "no committee woman passed vote")), row.names = c(NA, 
-13L), class = "data.frame")

在上面的示例中,我希望能够匹配除"she be woman.">

之外的所有行。这是我到目前为止的代码。我有一个积极的环顾((?<=woman\s)\w+"),似乎在正确的轨道上,但它与太多的前面的单词匹配。我尝试使用{1}与前面的一个单词匹配,但这种语法不起作用。

matches <- phrases_with_woman %>%
filter(str_detect(phrase, "^woman|(?<=woman\s)\w+")) 

感谢您的帮助。

每个条件都可以是一个选项,尽管最后一个条件需要两个选项,假设no/not/never可以是第一个或第二个单词。

library(dplyr)
pat <- "^(woman|\w+ woman|\w+ (no|not|never) woman|(no|not|never) \w+ woman)\b"
phrases_with_woman %>%
filter(grepl(pat, phrase))

我还没有想出一个regex解决方案,但这里有一个变通办法。

library(dplyr)
library(stringr)
phrases_with_woman %>%
filter(str_detect(word(phrase, 1, 2), "\bwoman\b") |
(word(phrase, 3) == "woman" & str_detect(word(phrase, 1, 2), "\b(no|not|never)\b")))
#                            phrase
# 1                woman get degree
# 2            woman obtain justice
# 3   session woman vote for member
# 4               woman have to end
# 5         woman have no existence
# 6                woman lose right
# 7                   woman be much
# 8              woman mix at dance
# 9            woman vote as member
# 10               woman have power
# 11                 woman act only
# 12 no committee woman passed vote

最新更新