我有一组文档:
documents = c("She had toast for breakfast",
"The coffee this morning was excellent",
"For lunch let's all have pancakes",
"Later in the day, there will be more talks",
"The talks on the first day were great",
"The second day should have good presentations too")
在这套文件中,我想删除非索引字。我已经删除了标点符号并转换为小写,使用:
documents = tolower(documents) #make it lower case
documents = gsub('[[:punct:]]', '', documents) #remove punctuation
首先,我转换为语料库对象:
documents <- Corpus(VectorSource(documents))
然后我尝试删除停用词:
documents = tm_map(documents, removeWords, stopwords('english')) #remove stopwords
但是最后一行会导致以下错误:
THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC() 进行调试。
这里已经问过这个问题,但没有给出答案。此错误是什么意思?
编辑
是的,我正在使用 tm 包。
下面是 sessionInfo() 的输出:
R 版本 3.0.2 (2013-09-25)平台:x86_64-苹果-达尔文10.8.0(64 位)
当我遇到tm
问题时,我通常最终只是编辑原始文本。
对于删除单词,这有点尴尬,但是您可以从tm
的停用词列表中粘贴一个正则表达式。
stopwords_regex = paste(stopwords('en'), collapse = '\b|\b')
stopwords_regex = paste0('\b', stopwords_regex, '\b')
documents = stringr::str_replace_all(documents, stopwords_regex, '')
> documents
[1] " toast breakfast" " coffee morning excellent"
[3] " lunch lets pancakes" "later day will talks"
[5] " talks first day great" " second day good presentations "
也许尝试使用 tm_map
函数来转换文档。它似乎适用于我的情况。
> documents = c("She had toast for breakfast",
+ "The coffee this morning was excellent",
+ "For lunch let's all have pancakes",
+ "Later in the day, there will be more talks",
+ "The talks on the first day were great",
+ "The second day should have good presentations too")
> library(tm)
Loading required package: NLP
> documents <- Corpus(VectorSource(documents))
> documents = tm_map(documents, content_transformer(tolower))
> documents = tm_map(documents, removePunctuation)
> documents = tm_map(documents, removeWords, stopwords("english"))
> documents
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 6
这会产生
> documents[[1]]$content
[1] " toast breakfast"
> documents[[2]]$content
[1] " coffee morning excellent"
> documents[[3]]$content
[1] " lunch lets pancakes"
> documents[[4]]$content
[1] "later day will talks"
> documents[[5]]$content
[1] " talks first day great"
> documents[[6]]$content
[1] " second day good presentations "
quanteda 包删除停用词,但首先确保您的词是标记,然后使用以下:
library(quanteda)
x<- tokens_select(x,stopwords(), selection=)
rflashtext
可能是一个选项:
library(tm)
library(rflashtext)
library(microbenchmark)
library(stringr)
documents <- c("She had toast for breakfast",
"The coffee this morning was excellent",
"For lunch let's all have pancakes",
"Later in the day, there will be more talks",
"The talks on the first day were great",
"The second day should have good presentations too") |> tolower()
stop_words <- stopwords("en")
输出:
processor$replace_keys(documents)
[1] " toast breakfast" " coffee morning excellent" " lunch pancakes"
[4] "later day, will talks" " talks first day great" " second day good presentations "
# rflastext
microbenchmark(rflashtext = {
processor <- KeywordProcessor$new(keys = stop_words, words = rep.int(" ", length(stop_words)))
processor$replace_keys(documents)
})
Unit: microseconds
expr min lq mean median uq max neval
rflashtext 264.529 268.8515 280.9786 272.8165 282.0745 512.499 100
# stringr
microbenchmark(stringr = {
stopwords_regex <- sprintf("\b%s\b", paste(stop_words, collapse = "\b|\b"))
str_replace_all(documents, stopwords_regex, " ")
})
Unit: microseconds
expr min lq mean median uq max neval
stringr 646.454 650.7635 665.9317 658.328 670.7445 937.575 100
# tm
microbenchmark(tm = {
corpus <- Corpus(VectorSource(documents))
tm_map(corpus, removeWords, stop_words)
})
Unit: microseconds
expr min lq mean median uq max neval
tm 233.451 239.012 253.3898 247.086 262.143 442.706 100
There were 50 or more warnings (use warnings() to see the first 50)
注意:为了简单起见,我不考虑删除标点符号