数据操作-在R中调用StemCompletion和PlainTextDocument损坏的文本



给定一个文本语料库,希望在R中使用tm(文本挖掘)包进行词干词干和词干补全,以规范术语,然而,词干补完步骤在该包的0.6.x版本中存在问题。使用R 3.3.1和tm 0.6-2。

这个问题以前被问过,但还没有看到一个真正有效的完整答案。以下是正确演示该问题的完整代码。

 require(tm)
 txt <- c("Once we have a corpus we typically want to modify the documents in it",
          "e.g., stemming, stopword removal, et cetera.",
          "In tm, all this functionality is subsumed into the concept of a transformation.")
 myCorpus <- Corpus(VectorSource(txt))
 myCorpus <- tm_map(myCorpus, content_transformer(tolower))
 myCorpus <- tm_map(myCorpus, removePunctuation)
 myCorpusCopy <- myCorpus
 # *Removing common word endings* (e.g., "ing", "es") 
 myCorpus <- tm_map(myCorpus, stemDocument, language = "english")
 # Next, we remove all the empty spaces generated by isolating the
 # word stems in the previous step.
 myCorpus <- tm_map(myCorpus, content_transformer(stripWhitespace))
 tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf)))
 print(tdm)
 print(dimnames(tdm)$Terms)

这是输出:

<<TermDocumentMatrix (terms: 19, documents: 2)>>
Non-/sparse entries: 20/18
Sparsity           : 47%
Maximal term length: 9
Weighting          : term frequency (tf)
 [1] "all"       "cetera"    "concept"   "corpus"    "document" 
 [6] "function"  "have"      "into"      "modifi"    "onc"      
[11] "remov"     "stem"      "stopword"  "subsum"    "the"      
[16] "this"      "transform" "typic"     "want"     

其中几个术语已经被词干:"modivi"、"remov"、"subcom"、"typic"one_answers"onc"。

接下来,要完成词干。

myCorpus = tm_map(myCorpus, stemCompletion, dictionary=myCorpusCopy)

在这个阶段,语料库不再是TextDocument,创建TermDocumentMatrix失败,并返回错误:inherits(doc,"TextDocument")不是TRUE。已经记录了下一步应用PlainTextDocument()功能。

myCorpus <- tm_map(myCorpus, PlainTextDocument)
tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf)))
print(tdm)
print(dimnames(tdm)$Terms)

这是输出:

<TermDocumentMatrix (terms: 2, documents: 2)>>
Non-/sparse entries: 4/0
Sparsity           : 0%
Maximal term length: 7
Weighting          : term frequency (tf)
[1] "content" "meta"   

调用PlainTextDocument已损坏语料库。

期望词干完成:例如"modivi"=>"modifier"、"onc"=>"once"等。

调用PlainTextDocument没有损坏语料库。

您可能已经注意到,当您运行线路时

myCorpus = tm_map(myCorpus, stemCompletion, dictionary=myCorpusCopy)

你收到了几个警告信息:

Warning messages:
1: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
  argument 'pattern' has length > 1 and only the first element will be used
2: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
  argument 'pattern' has length > 1 and only the first element will be used
3: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
  argument 'pattern' has length > 1 and only the first element will be used

这些值得一提;)

以下是如何使用您的数据进行树干堵塞:

txt <- c("Once we have a corpus we typically want to modify the documents in it",
         "e.g., stemming, stopword removal, et cetera.",
         "In tm, all this functionality is subsumed into the concept of a transformation.")
myCorpus <- Corpus(VectorSource(txt))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
tdm      <- TermDocumentMatrix(myCorpus, control = list(stemming = TRUE)) 
cbind(stems = rownames(tdm), completed = stemCompletion(rownames(tdm), myCorpus))
          stems       completed       
all       "all"       "all"           
cetera    "cetera"    "cetera"        
concept   "concept"   "concept"       
corpus    "corpus"    "corpus"        
document  "document"  "documents"     
function  "function"  "functionality" 
have      "have"      "have"          
into      "into"      "into"          
modifi    "modifi"    "modify"              
onc       "onc"       "once"          
remov     "remov"     "removal"       
stem      "stem"      "stemming"      
stopword  "stopword"  "stopword"      
subsum    "subsum"    "subsumed"      
the       "the"       "the"           
this      "this"      "this"          
transform "transform" "transformation"
typic     "typic"     "typically"     
want      "want"      "want"

要将更改永久写回TDM:

stemCompletion_mod <- function(x,dict=dictCorpus) {
  PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),
                                                         dictionary=dict, type="shortest"),sep="", 
                                          collapse=" ")))}
tdm <- stemCompletion_mod(rownames(tdm), myCorpus)  

tdm$content

[1] "所有cetera概念语料库文档功能都已进入NA一旦删除词干停止语就包含了这个转换通常需要"

关于Hack-R的解决方案,我遇到了与Jason相同的问题,我希望在单词云中使用"StemCompleted"单词,并将其作为TDM的一部分。

由于stemCompletion不返回TDM,所以我从TDM中提取了"terms",然后对其运行stemComplementation。

(我在测试时把它们分解成一个单独的变量)

require(tm)
txt <- c("Once we have a corpus we typically want to modify the documents in it",
      "e.g., stemming, stopword removal, et cetera.",
      "In tm, all this functionality is subsumed into the concept of a transformation.")
myCorpus <- Corpus(VectorSource(txt))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpusCopy <- myCorpus
 # *Removing common word endings* (e.g., "ing", "es") 
myCorpus <- tm_map(myCorpus, stemDocument, language = "english")
 # Next, we remove all the empty spaces generated by isolating the
 # word stems in the previous step.
myCorpus <- tm_map(myCorpus, content_transformer(stripWhitespace))
tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf)))
print(tdm)
print(dimnames(tdm)$Terms)

给出此输出:

 [1] "all"       "cetera"    "concept"   "corpus"    "document" 
 [6] "function"  "have"      "into"      "modifi"    "onc"      
[11] "remov"     "stem"      "stopword"  "subsum"    "the"      
[16] "this"      "transform" "typic"     "want"     

由于stemComplete似乎返回了一个字符表,我只是用stemCompleted版本替换了"tdm"的terms部分:

tdm$dimnames$Terms <- as.character(stemCompletion(tdm$dimnames$Terms, myCorpusCopy, type = "prevalent"))
print(tdm$dimnames$Terms)

这给了我:

 [1] "all"            "cetera"         "concept"        "corpus"        
 [5] "documents"      "functionality"  "have"           "into"          
 [9] ""               "once"           "removal"        "stemming"      
[13] "stopword"       "subsumed"       "the"            "this"          
[17] "transformation" "typically"      "want"          

很明显,你会在它不知道该怎么做的单词上得到空白字段("modifi"),但至少这次你可以使用stemCompleted版本。。。

相关内容

  • 没有找到相关文章

最新更新