从 R 中的用户定义语料库中删除停用词



我有一组文档:

documents = c("She had toast for breakfast",
 "The coffee this morning was excellent", 
 "For lunch let's all have pancakes", 
 "Later in the day, there will be more talks", 
 "The talks on the first day were great", 
 "The second day should have good presentations too")

在这套文件中,我想删除非索引字。我已经删除了标点符号并转换为小写,使用:

documents = tolower(documents) #make it lower case
documents = gsub('[[:punct:]]', '', documents) #remove punctuation

首先,我转换为语料库对象:

documents <- Corpus(VectorSource(documents))

然后我尝试删除停用词:

documents = tm_map(documents, removeWords, stopwords('english')) #remove stopwords

但是最后一行会导致以下错误:

THE_PROCESS_HAS_FORKED_AND_YOU_CANNOT_USE_THIS_COREFOUNDATION_FUNCTIONALITY___YOU_MUST_EXEC() 进行调试。

这里已经问过这个问题,但没有给出答案。此错误是什么意思?

编辑

是的,我正在使用 tm 包。

下面是 sessionInfo() 的输出:

R 版本 3.0.2 (2013-09-25)平台:x86_64-苹果-达尔文10.8.0(64 位)

当我遇到tm问题时,我通常最终只是编辑原始文本。

对于删除单词,这有点尴尬,但是您可以从tm的停用词列表中粘贴一个正则表达式。

stopwords_regex = paste(stopwords('en'), collapse = '\b|\b')
stopwords_regex = paste0('\b', stopwords_regex, '\b')
documents = stringr::str_replace_all(documents, stopwords_regex, '')
> documents
[1] "     toast  breakfast"             " coffee  morning  excellent"      
[3] " lunch lets   pancakes"            "later   day  will   talks"        
[5] " talks   first day  great"         " second day   good presentations "

也许尝试使用 tm_map 函数来转换文档。它似乎适用于我的情况。

> documents = c("She had toast for breakfast",
+  "The coffee this morning was excellent", 
+  "For lunch let's all have pancakes", 
+  "Later in the day, there will be more talks", 
+  "The talks on the first day were great", 
+  "The second day should have good presentations too")
> library(tm)
Loading required package: NLP
> documents <- Corpus(VectorSource(documents))
> documents = tm_map(documents, content_transformer(tolower))
> documents = tm_map(documents, removePunctuation)
> documents = tm_map(documents, removeWords, stopwords("english"))
> documents
<<VCorpus>>
Metadata:  corpus specific: 0, document level (indexed): 0
Content:  documents: 6

这会产生

> documents[[1]]$content
[1] "  toast  breakfast"
> documents[[2]]$content
[1] " coffee  morning  excellent"
> documents[[3]]$content
[1] " lunch lets   pancakes"
> documents[[4]]$content
[1] "later   day  will   talks"
> documents[[5]]$content
[1] " talks   first day  great"
> documents[[6]]$content
[1] " second day   good presentations "
您可以使用

quanteda 包删除停用词,但首先确保您的词是标记,然后使用以下:

library(quanteda)
x<- tokens_select(x,stopwords(), selection=)

rflashtext可能是一个选项:

library(tm)
library(rflashtext)
library(microbenchmark)
library(stringr)
documents <- c("She had toast for breakfast",
              "The coffee this morning was excellent", 
              "For lunch let's all have pancakes", 
              "Later in the day, there will be more talks", 
              "The talks on the first day were great", 
              "The second day should have good presentations too") |> tolower()
stop_words <- stopwords("en")

输出:

processor$replace_keys(documents)
[1] "    toast   breakfast"                 "  coffee   morning   excellent"        "  lunch       pancakes"               
[4] "later     day,   will     talks"       "  talks     first day   great"         "  second day     good presentations  "
# rflastext
microbenchmark(rflashtext = {
  processor <- KeywordProcessor$new(keys = stop_words, words = rep.int(" ", length(stop_words)))
  processor$replace_keys(documents)
})
Unit: microseconds
       expr     min       lq     mean   median       uq     max neval
 rflashtext 264.529 268.8515 280.9786 272.8165 282.0745 512.499   100
# stringr
microbenchmark(stringr = {
  stopwords_regex <- sprintf("\b%s\b", paste(stop_words, collapse = "\b|\b"))
  str_replace_all(documents, stopwords_regex, " ")
})
Unit: microseconds
    expr     min       lq     mean  median       uq     max neval
 stringr 646.454 650.7635 665.9317 658.328 670.7445 937.575   100
# tm 
microbenchmark(tm = {
  corpus <- Corpus(VectorSource(documents))
  tm_map(corpus, removeWords, stop_words)
})
Unit: microseconds
 expr     min      lq     mean  median      uq     max neval
   tm 233.451 239.012 253.3898 247.086 262.143 442.706   100
There were 50 or more warnings (use warnings() to see the first 50)

注意:为了简单起见,我不考虑删除标点符号

最新更新