返回字符串模式匹配加上模式前后的文本



假设我有5个人的日记条目,我想确定他们是否提到任何与食物相关的关键词。我想要一个关键字的输出,在确定它们是否与食物相关之前,在一个单词的前后窗口提供上下文。

搜索应该是不区分大小写的,如果关键字嵌入在另一个词中是可以的。例如,如果一个关键字是"rice",我想输出包含"price"

假设我有以下数据:

foods <- c('corn', 'hot dog', 'ham', 'rice')
df <- data.frame(id = 1:5,
diary = c('I ate rice and corn today',
'Sue ate my corn.',
'He just hammed it up',
'Corny jokes are my fave',
'What is the price of milk'))

我要找的输出是:

|ID|Output                          |
|--|--------------------------------|
|1 |"ate rice and", "and corn today"|
|2 |"my corn"                       |
|3 |"just hammed it"                |
|4 |"Corny jokes"                   |
|5 |"the price of"                  |

我使用了strings::stri_detect,但输出包括整个日记条目。

我已经使用了strings::stri_extract,但是我找不到一种方法在关键字的前后分别包含一个单词。

当相同的食物出现多个时,以下解决方案有效在同一短语中的次数。它是基于将每个短语分成单独的单词。

library(tidyverse)
extract3 <- function(txt, word)
{
str_split(txt, "\W") %>% 
unlist() %>% 
{. ->> w} %>% 
map(~ str_extract(.x,regex(paste0("(.)*",word,"(.)*"),ignore_case=T))) %>% 
unlist() %>% 
is.na() %>% 
`!` %>% 
which() %>% 
map_chr(~ paste(
w[unique(c(max(c(.x-1,1)),.x,min(c(.x+1,length(w)))))], collapse = " ")) %>% 
paste(collapse = ", ")
}
df_out <- tibble()
for (i in 1:nrow(df))
for (j in 1:length(foods))
df_out <- rbind(df_out,
tibble(
id=df$id[i],diary=df$diary[i], output=extract3(df$diary[i],foods[j])))
df_out %>% 
filter(output != "") %>% 
group_by(id) %>% 
mutate(output=paste(output,collapse = ", ")) %>% 
ungroup() %>% 
distinct()

EDITED (WITHOUT FOR CYCLES)

library(tidyverse)
extract3 <- function(txt, word)
{
str_split(txt, "\W") %>% 
unlist() %>% 
{. ->> w} %>% 
map(~ str_extract(.x,regex(paste0("(.)*",word,"(.)*"),ignore_case=T))) %>% 
unlist() %>% 
is.na() %>% 
`!` %>% 
which() %>% 
map_chr(~ paste(
w[unique(c(max(c(.x-1,1)),.x,min(c(.x+1,length(w)))))], collapse = " ")) %>% 
paste(collapse = ", ") %>% 
str_trim()
}
map_dfr(
1:nrow(df), 
function(id) map_dfr(1:length(foods), ~ tibble(
id = df$id[id],
diary = df$diary[id],
output = extract3(df$diary[id], foods[.])))) %>% 
filter(output != "") %>% 
group_by(id) %>% 
mutate(output = paste(output,collapse = ", ")) %>% 
ungroup() %>% 
distinct()

我们可以折叠正则表达式并提取在折叠模式之前或之后的单词("w+")。regex()函数允许参数ignore_case = TRUE,这对于不区分大小写的匹配非常有用。我们可能必须在折叠模式周围包括可选的词边界,因此ricepricehamhammed都包括在内。我对数据做了一些小改动,使它更能说明问题。

我贴了两个答案。将排除较大单词中的匹配项,例如"hammed"或"price",因此非食品匹配将返回空字符串。另一个更具包容性。

library(dplyr)
library(stringr)
df %>% mutate(Output = str_extract_all (diary,
regex(paste0("\w+\s+(",
paste("\b",foods, "\b", collapse = "|", sep=''),
")\s+\w+"),
ignore_case=TRUE)))

输出1

id                        diary             Output
1  1    I ate rice and corn today       ate rice and
2  2             Sue ate my corn.                  
3  3         He just hammed it up                  
4  4      Corny jokes are my fave                  
5  5    What is the price of milk                  
6  6 I like to eat ham sandwiches eat ham sandwiches

解决方案2

df %>% mutate(Output = str_extract_all (diary,
regex(paste0("\w+\s+(",
paste("\b\w*",foods, "\w*\b", collapse = "|", sep=''),
")\s+\w+"),
ignore_case=TRUE)))
id                        diary             Output
1  1    I ate rice and corn today       ate rice and
2  2             Sue ate my corn.                  
3  3         He just hammed it up     just hammed it
4  4      Corny jokes are my fave                  
5  5    What is the price of milk       the price of
6  6 I like to eat ham sandwiches eat ham sandwiches

foods <- c('corn', 'hot dog', 'ham', 'rice')
df <- data.frame(id = 1:6,
diary = c('I ate rice and corn today',
'Sue ate my corn.',
'He just hammed it up',
'Corny jokes are my fave',
'What is the price of milk',
'I like to eat ham sandwiches'))

最后编辑

我解决了" "的问题,并处理了多个匹配的问题。我们需要做一个嵌套循环。第一个循环遍历"diary"中的所有条目(外循环)。然后,在内部循环中,循环遍历所有"foods",并使用适当的正则表达式调用"str_extract_all"。最初的正则表达式要求在食物单词之前或之后加上另一个单词,因此句子边界上的食物是不匹配的。我在周围的单词(\w+\s+)周围包含了?量词(0或1匹配),因此一切都很顺利。唯一剩下的问题是在多个比赛中的比赛顺序,它仍然是奇怪的。但我认为现在的解决方案是好的。

df %>% mutate(output=map(df$diary,
~map(foods, (x) str_extract_all(.x,
regex(paste0("(\w+\s+)?(",
 paste("\b\w*", x, "\w*\b", collapse = "|", sep=''),
 ")(\s+\w+)?"),
ignore_case=TRUE))))%>%
map(unlist))
id                        diary                       output
1  1    I ate rice and corn today and corn today, ate rice and
2  2             Sue ate my corn.                      my corn
3  3         He just hammed it up               just hammed it
4  4      Corny jokes are my fave                  Corny jokes
5  5    What is the price of milk                 the price of
6  6 I like to eat ham sandwiches           eat ham sandwiches

不完全确定这是否100%有用,但值得一试:

首先,将关键字定义为不区分大小写的交替模式:
patt <- paste0("(?i)(", paste0(foods, collapse = "|"), ")")

然后提取left上的单词,关键字本身称为node,right上的单词使用stringr的函数str_extract_all:

library(stringr)
df1 <- data.frame(
left = unlist(str_extract_all(gsub("[.,!?]", "", df$diary), paste0("(?i)(\S+|^)(?=\s?", patt, ")"))),
node = unlist(str_extract_all(gsub("[.,!?]", "", df$diary), patt)),
right = unlist(str_extract_all(gsub("[.,!?]", "", df$diary), paste0("(?<=\s?", patt, "\s?)(\S+|$)")))
)

结果:

df1
left node right
1  ate rice   and
2  and corn today
3   my corn      
4 just  ham   med
5      Corn     y
6    p rice    of

虽然这不是预期的输出,但它仍然可以满足您的目的iff其目的是检查匹配项是否确实是关键字。例如,在第5行和第6行中,df1提供的视图立即清楚地表明这些不是关键字匹配。

编辑:

此解决方案保留id值:

library(tidyverse)
library(purrr)
extract_ <- function(df_row){
df1 <- data.frame(
id = df_row$id,    
left = unlist(str_extract_all(gsub("[.,!?]", "", df_row$diary), paste0("(?i)(\S+|^)(?=\s?", patt, ")"))),
node = unlist(str_extract_all(gsub("[.,!?]", "", df_row$diary), patt)),
right = unlist(str_extract_all(gsub("[.,!?]", "", df_row$diary), paste0("(?<=\s?", patt, "\s?)(\S+|$)")))
)
}
df %>% 
group_split(id) %>%    # splits data frame into list of bins, i.e. by id
map_dfr(.x, .f = ~ extract_(.x))  # now we iterate over bins with our function
id left node right
1  1  ate rice   and
2  1  and corn today
3  2   my corn      
4  3 just  ham   med
5  4      Corn     y
6  5    p rice    of

相关内容

  • 没有找到相关文章

最新更新