r-删除文本中的所有标点符号,包括tm包的撇号



我有一个由Tweets(只是消息文本(组成的of向量,我正在清理它以进行文本挖掘。我使用了tm包中的removePunctuation,如下所示:

clean_tweet_text = removePunctuation(tweet_text)

这导致了一个向量,从文本中删除了除撇号之外的所有标点符号,这破坏了我的关键字搜索,因为接触撇号的单词没有注册。例如,我的一个关键词是climate,但如果一条推文有'climate,它就不会被计算在内。

如何从矢量中删除所有撇号/单引号?

以下是来自dput的标题,用于可复制的示例:

c("expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap", 
"who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https…", 
"rt oddly enough some republicans think climate change is real oddly enough… httpstcomtlfx1mnuf uniteblue https…", 
"better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok", 
"i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok", 
"why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl", 
"ted cruz ‘climate change is not science it’s religion’ httpstco0qqtbofe0h via glennbeck", 
"unusual warming kills gulf of maine cod  discovery news globalwarming  httpstco39uvock3xe", 
"this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc", 
"what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o"
)

要删除所有标点符号(包括撇号和单引号(,只需使用gsub():

x <- c("expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap",
"who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https…",
"rt oddly enough some republicans think climate change is real oddly enough… httpstcomtlfx1mnuf uniteblue https…",
"better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok",
"i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok",
"why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl",
"ted cruz ‘climate change is not science it’s religion’ httpstco0qqtbofe0h via glennbeck",
"unusual warming kills gulf of maine cod discovery news globalwarming httpstco39uvock3xe",
"this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc",
"what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o")
gsub("[[:punct:]]", "", x)
#>  [1] "expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap"                                                
#>  [2] "who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https"           
#>  [3] "rt oddly enough some republicans think climate change is real oddly enough httpstcomtlfx1mnuf uniteblue https"            
#>  [4] "better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"              
#>  [5] "i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"                  
#>  [6] "why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl"                      
#>  [7] "ted cruz climate change is not science its religion httpstco0qqtbofe0h via glennbeck"                                     
#>  [8] "unusual warming kills gulf of maine cod discovery news globalwarming httpstco39uvock3xe"                                  
#>  [9] "this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc"       
#> [10] "what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o"

gsub()用其第二个自变量替换其第三个自变量中出现的所有第一个自变量(请参见help("gsub")(。这里,这意味着它将集合[[:punct:]]中的任何字符在我们的向量x中的所有出现替换为""(移除它们(。

这会删除哪些字符?来自help("regex"):

[:punct:]

 nbsp nbsp;标点符号:
 nbsp nbsp;!"#$%&'((*+,-./:;<=>?@[\]^_`{|}~.

更新

出现这种情况似乎是因为你的撇号像,而不是像'。所以,如果你想坚持使用tm::removePunctuation(),你也可以使用

tm::removePunctuation(x, ucp = TRUE)
#>  [1] "expert briefing on climatechange disarmament sdgs nmun httpstco5gqkngpkap"                                                
#>  [2] "who uses nasa earth science data he looks at impact of aerosols on climateamp weather httpstcof4azsiqkw1 https"           
#>  [3] "rt oddly enough some republicans think climate change is real oddly enough httpstcomtlfx1mnuf uniteblue https"            
#>  [4] "better dead than red bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"              
#>  [5] "i see red people bill gates says that only socialism can save us from climate change httpstcopypqmd1fok"                  
#>  [6] "why go for ecosystem basses conservation climatechange raajje maldives ecocaremv httpstcorauhjbasyl"                      
#>  [7] "ted cruz climate change is not science its religion httpstco0qqtbofe0h via glennbeck"                                     
#>  [8] "unusual warming kills gulf of maine cod discovery news globalwarming httpstco39uvock3xe"                                  
#>  [9] "this is an amusing headline bill gates says that only socialism can save us from climate change httpstcobfs5zbcijc"       
#> [10] "what do the remaining republican candidates have to say about climate change fixgov httpstcoxpszwbrcnh httpstcodgqyidkw6o"

最新更新