r语言 - 从字符向量中删除可能包含特殊字符且不匹配单词部分的整个单词列表



我在R中有一个单词列表,如下所示:

myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")

我想从文本中删除上述列表中找到的单词,如下所示:

myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."

删除不需要的 myList 单词后,myText 应如下所示:

This is at Sample Text, which is better and cleaned, where is not equal to. This is messy text.

我正在使用:

stringr::str_replace_all(myText,"[^a-zA-Z\s]", " ")

但这对我没有帮助。我该怎么办??

您可以使用带有gsub基本R函数的PCRE正则表达式(它也可以在str_replace_all中与ICU正则表达式一起使用(:

s*(?<!w)(?:at|ax|CL|OZ|Gm|Kg|C100|-1.00)(?!w)

请参阅正则表达式演示。

  • s*- 0 个或更多空格
  • (?<!w)- 否定的回溯,确保当前位置之前没有单词字符
  • (?:at|ax|CL|OZ|Gm|Kg|C100|-1.00)- 一个非捕获组,其中包含字符向量中的转义项目,其中包含需要删除的单词
  • (?!w)- 一个负面的前瞻,确保当前位置之后没有单词字符。

注意:我们不能在这里使用b词边界,因为myList字符向量中的项目可能以非单词字符开头/结尾,而b含义取决于上下文。

在线观看 R 演示:

myList <- c("at","ax","CL","OZ","Gm","Kg","C100","-1.00")
myText <- "This is at Sample ax Text, which CL is OZ better and cleaned Gm, where C100 is not equal to -1.00. This is messy text Kg."
escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\])", "\\\1", s)) }
pat <- paste0("\s*(?<!\w)(?:", paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|"), ")(?!\w)")
cat(pat, collapse="n")
gsub(pat, "", myText, perl=TRUE)
## => [1] "This is Sample Text, which is better and cleaned, where is not equal to. This is messy text."

  • escape_for_pcre <- function(s) { return(gsub("([{[()|?$^*+.\])", "\\\1", s)) }- 在 PCRE 模式中转义所有需要转义的特殊字符
  • paste(sapply(myList, function(t) escape_for_pcre(t)), collapse = "|")- 从搜索词向量中创建|分隔的替代列表。
gsub(paste0(myList, collapse = "|"), "", myText)

给:

[1] "This is  Sample  Text, which  is  better and cleaned , where  is not equal to . This is messy text ."

最新更新