来自kaggle.com中的路透社数据集当我使用tm包的stemDocument时,我看到了一些不想要的结果:
TM的stemDocument | |||
---|---|---|---|
充气 | 未经编辑的州 |
使用语料库:
trade.train.directory<-"F:\Reuters_Dataset\training\trade"
trade.trnCorpus <- VCorpus(DirSource(directory = trade.train.directory, encoding = "ASCII"))
首先列出你想要修改词干的原始单词:
```{r}
unStemed.words<-c("anniversary","united", "february","many","inflation","initially")
#stemed.map<-setNames(as.list(unStemed.words),stemed.words)
然后根据需要列出相应的词干:
stemed.words<-c("anniversary","united","february","many","inflate","initial")
现在创建一个函数,用mark_STRING标记原始单词,以便在应用tm.stemDocument:时不会更改这些单词
EXCLUSION_MARK="_EXCLUUUUUU"
markStemExclusion<- content_transformer(function(corpus){
for(i in 1:length(unStemed.words))
{
corpus<-gsub(paste0('\b',unStemed.words[i],'\b'),paste0(EXCLUSION_MARK,stemed.words[i],sep="_"),corpus)
}
return(corpus)
})
标记后,我们准备应用stemDocument,后者从那些被排除的单词中删除标记:
unMarkStemExclusion<-content_transformer( function(corpus)
{
corpus<-gsub(EXCLUSION_MARK," ",corpus)
return (corpus)
})
现在逐一调用其他数据清理方法:
最后你会看到想要的结果:
cleanData<-function(corpus,excludeStopWords=FALSE)
{
corpus <- tm_map(corpus, removeNumbers)
corpus<- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus,replaceAbbreviations)
call markStemExclusion
corpus<-tm_map(corpus,markStemExclusion)
if(excludeStopWords==FALSE){
corpus <-tm_map(corpus,removeWords,c(
"said", "will","next",
stopwords("english")))}
corpus<-tm_map(corpus,cleanHtmlTags)
调用stemDocument
corpus <- tm_map(corpus, stemDocument)
call unmarkStemExclusion
corpus<-tm_map(corpus,unMarkStemExclusion)
corpus<-tm_map(corpus,replacePunctBySpace)
corpus<- tm_map(corpus, stripWhitespace)
return (corpus)
}