我想找到一个单词在一个话语中出现的第n次,并把它括起来。我尝试了各种各样的东西,但我认为最接近我得到的是与gsub,但我不能有{copy-1}的次数在我的正则表达式。什么好主意吗?我们能在这里放一个变量吗?谢谢!
#creating my df
utterance <- c("we are not who we think we are", "they know who we are")
df <- data.frame(utterance)
df$occurences = str_count(df$utterance, "we")
df <- df %>% mutate(ID = row_number())
df <- df %>% uncount(occurences) %>% group_by(ID) %>% mutate(copy = row_number())
#this is my gsub
gsub("((?:we){copy-1}.*)we", "\[we\]", df$utterance)
这将是我的结果
utterance ID copy
<chr> <int> <int>
1 [we] are not who we think we are 1 1
2 we are not who [we] think we are 1 2
3 we are not who we think [we] are 1 3
4 they know who [we] are 2 1
这样如何:
library(tidyverse)
f <- function(s,c,target) {
g = gregexpr(target,s)[[1]][c]
if(is.na(g) | g<0) return(s)
paste0(str_sub(s,1,g-1),"[",target,"]",str_sub(s,1+g+length(target)))
}
df %>% rowwise() %>% mutate(utterance = f(utterance,copy, "we"))
输出:
utterance ID copy
<chr> <int> <int>
1 [we] are not who we think we are 1 1
2 we are not who [we] think we are 1 2
3 we are not who we think [we] are 1 3
4 they know who [we] are 2 1
注意,这也会找到不是完整单词的targets
。例如,"我们"出现的第二次;我们去了昨天去的地方。是单词"went"的前两个字母,而不是单词"we"的第二次出现。如果希望限制为整个单词,可以将gregexpr()调用更新为:
g = gregexpr(paste0("\b",target, "\b"),s)[[1]][c]
这是一个字符串分割方法。我们可以在we
上分割输入字符串,然后将其拼接在一起,使用[we]
作为第n个连接器。
repn <- function(x, find, repl, n) {
parts <- strsplit(x, paste0("\b", find, "\b"))[[1]]
output <- paste0(
paste0(parts[1:n], collapse=find),
repl,
paste0(parts[(n+1):length(parts)], collapse="we")
)
return(output)
}
x <- "we are not who we think we are"
repn(x, "we", "[we]", 1)
repn(x, "we", "[we]", 2)
repn(x, "we", "[we]", 3)
[1] "[we] are not who we think we are"
[1] "we are not who [we] think we are"
[1] "we are not who we think [we] are"
下面是使用一些附加包的混合方法:
library(data.table)
library(tibble)
library(dplyr)
library(tidyr)
df %>%
rowid_to_column() %>%
separate_rows(utterance, sep = " ") %>%
group_by(rowid) %>%
mutate(wordcount = ifelse(utterance == "we", rleid(rowid), NA), # simpler: wordcount = ifelse(utterance == "we", 1, NA)
wordcount = cumsum(!is.na(wordcount))) %>%
mutate(utterance = ifelse(utterance == "we" & wordcount == copy, paste0("[", utterance, "]"), utterance)) %>%
summarise(utterance = paste0(utterance, collapse = " ")) %>%
bind_cols(.,df[,2:3])
# A tibble: 4 × 4
rowid utterance ID copy
<int> <chr> <int> <int>
1 1 [we] are not who we think we are 1 1
2 2 we are not who [we] think we are 1 2
3 3 we are not who we think [we] are 1 3
4 4 they know who [we] are 2 1