假设我有以下句子:
s = c("I don't want to remove punctuation for negations. Instead, I want to remove only general punctuation. For example, keep I wouldn't like it but remove Inter's fan or Man city's fan.")
我希望有以下结果:
"I don't want to remove punctuation for negations Instead I want to remove only general punctuation For example keep I wouldn't like it but remove Inter fan or Man city fan."
现在,如果我简单地使用下面的代码,我删除了否定中的's '和'。
s %>% str_replace_all("['']s\b|[^[:alnum:][:blank:]@_]"," ")
"I don t want to remove punctuation for negations Instead I want to remove only general punctuation For example keep I wouldn t like it but remove Inter fan or Man city fan "
总而言之,我需要一个代码来删除一般的标点符号,包括" "除了我想保留原始格式的底片。
有人能帮我吗?
谢谢!
您可以使用前瞻性(?!t)
测试,[:punct:]
之后没有t
。
gsub("[[:punct:]](?!t)\w?", "", s, perl=TRUE)
#[1] "I don't want to remove punctuation for negations Instead I want to remove only general punctuation For example keep I wouldn't like it but remove Inter fan or Man city fan"
如果你想更严格,你可以在没有n
之前再测试(?<!n)
。
gsub("(?<!n)[[:punct:]](?!t)\w?", "", s, perl=TRUE)
或者如果将其限制为仅't
(感谢@chris-ruehlemann)
gsub("(?!'t)[[:punct:]]\w?", "", s, perl=TRUE)
或删除所有punct
,但不删除'
或's
:
gsub("[^'[:^punct:]]|'s", "", s, perl = TRUE)
相同,但使用look ahead:
gsub("(?!')[[:punct:]]|'s", "", s, perl = TRUE)
我们可以分两步完成,首先删除除"'"
以外的所有标点符号,然后使用fixed match删除"'s"
:
gsub("'s", "", gsub("[^[:alnum:][:space:]']", "", s), fixed = TRUE)