我以前问过一个类似的问题,但这个问题要具体得多,需要一个不同于之前提供的解决方案,所以我希望它可以发布。我需要在我的文本中只保留撇号和字内破折号(删除所有其他标点符号)。例如,我想从str1:
得到str2str1<-"I'm dash before word -word, dash &%$,. in-between word, two before word --word just dashes ------, between words word - word"
str2<-"I'm dash before word word dash in-between word two before word word just dashes between words word word"
我目前的解决方案是,首先删除单词之间的破折号:gsub(" - ", " ", str1)
,然后留下字母和数字字符加上剩余的破折号gsub("[^[:alnum:]['-]", " ", str1)
问题是,它不删除后面的破折号,例如"-"和单词开头和结尾的破折号:"-word"或"word -"
我想这样就可以了:
gsub('( |^)-+|-+( |$)', '\1', gsub("[^ [:alnum:]'-]", '', str1))
#[1] "I'm dash before word word dash in-between word two before word word just dashes between words word word"
方法如下:
gsub("([[:alnum:]][[:punct:]][[:alnum:]])|[[:punct:]]", "\1", str1)
# [1] "I'm dash before word word dash in-between word two before word word just dashes between words word word"
或者更明确地:
gsub("([[:alnum:]]['-][[:alnum:]])|[[:punct:]]", "\1", str1)
相同,略有不同/更短:
gsub("(\w['-]\w)|[[:punct:]]", "\1", str1, perl=TRUE)
我建议
x <- "I'm dash before word -word, dash &%$,. in-between word, two before word --word just dashes ------, between words word - word"
gsub("\b([-'])\b|[[:punct:]]+", "\1", x, perl=TRUE)
# => "I'm dash before word word dash in-between word two before word word just dashes between words word word"
参见R演示。正则表达式是
b([-'])b|[[:punct:]]+
参见regex演示。细节:
-
b([-'])b
--
或'
用字字符(字母,数字或_
)包围(注意:如果你只想保持在字母之间,使用(?<=p{L})([-'])(?=p{L})
代替) -
|
-或 -
[[:punct:]]+
- 1个或多个标点符号。
要删除替换后产生的任何前导/尾随字符和双空白字符,您可以使用
res <- gsub("\b([-'])\b|[[:punct:]]+", "\1", x, perl=TRUE)
res <- trimws(gsub("\s{2,}", " ", res))