去掉r中单词的反斜杠

我一直在尝试为文章做主题建模。我清理了包含大量反斜杠和数字的原始数据。即使在去掉标点、反斜杠和数字之后，我还是在主题1的最热门词汇中找到了反斜杠和数字。我用于预处理的代码片段是

articles <- tm::tm_map(articles, content_transformer(tolower))
# Remove numbers
articles<- tm_map(articles, removeNumbers)
# Remove english common stopwords
articles<- tm_map(articles, removeWords, stopwords("english"))
# Remove punctuations
articles<- tm_map(articles, removePunctuation)
# Eliminate extra white spaces
articles <- tm_map(articles, stripWhitespace)
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
articles <- tm_map(articles,toSpace, "\\" )

即使在尝试清理数据之后，我在主题的顶级术语中得到了反斜杠和数字，设计机器人
类
医疗
装置Wkh 03
学生
dcbl
ri03
课程

主题中的反斜杠和数字完全不合适。请给我一个解决方案

您可以使用字符串包。例如:

library(tidyverse)
df <- tibble(text = c("robot", "class", "medical", "device wkh\003", "students", "dcbl", "ri\003", "course", NA))

df %>% 
mutate(text = str_remove_all(text, "\\"))

# A tibble: 9 × 1
text         
<chr>        
1 robot        
2 class        
3 medical      
4 device wkh003
5 students     
6 dcbl         
7 ri003        
8 course       
9 NA

相关内容

最新更新

热门标签：