替换R中字符串中出现的第N个单词(子字符串),N为整型列的值

  • 本文关键字:字符串 整型 单词 替换 r
  • 更新时间 :
  • 英文 :


我想找到一个单词在一个话语中出现的第n次,并把它括起来。我尝试了各种各样的东西,但我认为最接近我得到的是与gsub,但我不能有{copy-1}的次数在我的正则表达式。什么好主意吗?我们能在这里放一个变量吗?谢谢!

#creating my df
utterance <- c("we are not who we think we are", "they know who we are")
df <- data.frame(utterance)
df$occurences = str_count(df$utterance, "we")
df <- df %>% mutate(ID = row_number())
df <- df %>% uncount(occurences) %>% group_by(ID) %>% mutate(copy = row_number()) 
#this is my gsub
gsub("((?:we){copy-1}.*)we", "\[we\]", df$utterance) 

这将是我的结果

utterance                         ID  copy
<chr>                          <int> <int>
1 [we] are not who we think we are     1     1
2 we are not who [we] think we are     1     2
3 we are not who we think [we] are     1     3
4 they know who [we] are               2     1

这样如何:

library(tidyverse)
f <- function(s,c,target) {
g = gregexpr(target,s)[[1]][c]
if(is.na(g) | g<0) return(s)
paste0(str_sub(s,1,g-1),"[",target,"]",str_sub(s,1+g+length(target)))
}
df %>% rowwise() %>% mutate(utterance = f(utterance,copy, "we"))

输出:

utterance                           ID  copy
<chr>                            <int> <int>
1 [we] are not who we think we are     1     1
2 we are not who [we] think we are     1     2
3 we are not who we think [we] are     1     3
4 they know who [we] are               2     1

注意,这也会找到不是完整单词的targets。例如,"我们"出现的第二次;我们去了昨天去的地方。是单词"went"的前两个字母,而不是单词"we"的第二次出现。如果希望限制为整个单词,可以将gregexpr()调用更新为:

g = gregexpr(paste0("\b",target, "\b"),s)[[1]][c]

这是一个字符串分割方法。我们可以在we上分割输入字符串,然后将其拼接在一起,使用[we]作为第n个连接器。

repn <- function(x, find, repl, n) {
parts <- strsplit(x, paste0("\b", find, "\b"))[[1]]
output <- paste0(
paste0(parts[1:n], collapse=find),
repl,
paste0(parts[(n+1):length(parts)], collapse="we")
)
return(output)
}
x <- "we are not who we think we are"
repn(x, "we", "[we]", 1)
repn(x, "we", "[we]", 2)
repn(x, "we", "[we]", 3)
[1] "[we] are not who we think we are"
[1] "we are not who [we] think we are"
[1] "we are not who we think [we] are"

下面是使用一些附加包的混合方法:

library(data.table)
library(tibble)
library(dplyr)
library(tidyr)
df %>%
rowid_to_column() %>%
separate_rows(utterance, sep = " ") %>%
group_by(rowid) %>%
mutate(wordcount = ifelse(utterance == "we", rleid(rowid), NA), # simpler: wordcount = ifelse(utterance == "we", 1, NA)
wordcount = cumsum(!is.na(wordcount))) %>% 
mutate(utterance = ifelse(utterance == "we" & wordcount == copy, paste0("[", utterance, "]"), utterance)) %>% 
summarise(utterance = paste0(utterance, collapse = " ")) %>%
bind_cols(.,df[,2:3])
# A tibble: 4 × 4
rowid utterance                           ID  copy
<int> <chr>                            <int> <int>
1     1 [we] are not who we think we are     1     1
2     2 we are not who [we] think we are     1     2
3     3 we are not who we think [we] are     1     3
4     4 they know who [we] are               2     1

最新更新