除去单词中除单撇号和连字符外的所有标点符号



我以前问过一个类似的问题,但这个问题要具体得多,需要一个不同于之前提供的解决方案,所以我希望它可以发布。我需要在我的文本中只保留撇号和字内破折号(删除所有其他标点符号)。例如,我想从str1:

得到str2
str1<-"I'm dash before word -word, dash &%$,. in-between word, two before word --word just dashes ------, between words word - word"
str2<-"I'm dash before word word dash in-between word two before word  word just dashes  between words word  word"

我目前的解决方案是,首先删除单词之间的破折号:
gsub(" - ", " ", str1)

,然后留下字母和数字字符加上剩余的破折号
gsub("[^[:alnum:]['-]", " ", str1)

问题是,它不删除后面的破折号,例如"-"和单词开头和结尾的破折号:"-word"或"word -"

我想这样就可以了:

gsub('( |^)-+|-+( |$)', '\1', gsub("[^ [:alnum:]'-]", '', str1))
#[1] "I'm dash before word word dash  in-between word two before word word just dashes  between words word  word"

方法如下:

gsub("([[:alnum:]][[:punct:]][[:alnum:]])|[[:punct:]]", "\1", str1)
# [1] "I'm dash before word word dash  in-between word two before word word just dashes  between words word  word"

或者更明确地:

gsub("([[:alnum:]]['-][[:alnum:]])|[[:punct:]]", "\1", str1)

相同,略有不同/更短:

gsub("(\w['-]\w)|[[:punct:]]", "\1", str1, perl=TRUE)

我建议

x <- "I'm dash before word -word, dash &%$,. in-between word, two before word --word just dashes ------, between words word - word"
gsub("\b([-'])\b|[[:punct:]]+", "\1", x, perl=TRUE)
# =>  "I'm dash before word word dash  in-between word two before word word just dashes  between words word  word"

参见R演示。正则表达式是

b([-'])b|[[:punct:]]+

参见regex演示。细节:

  • b([-'])b - -'用字字符(字母,数字或_)包围(注意:如果你只想保持在字母之间,使用(?<=p{L})([-'])(?=p{L})代替)
  • | -或
  • [[:punct:]]+ - 1个或多个标点符号。

要删除替换后产生的任何前导/尾随字符和双空白字符,您可以使用

res <- gsub("\b([-'])\b|[[:punct:]]+", "\1", x, perl=TRUE)
res <- trimws(gsub("\s{2,}", " ", res))

相关内容

  • 没有找到相关文章

最新更新