r-如何查看映射到特定词干的原始单词



我正在R中使用tm_map进行一些文本分析。我运行以下代码(没有错误)来生成(词干和其他预处理的)单词的文档术语矩阵。

corpus = Corpus(VectorSource(textVector))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, PlainTextDocument) 
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c(stopwords("english")))
corpus = tm_map(corpus, stemDocument, language="english")
dtm = DocumentTermMatrix(corpus)
mostFreqTerms = findFreqTerms(dtm, lowfreq=125) 

但是,当我查看我的(词干)mostFreqTerms时,我看到一些词让我想,"嗯,是什么词词干产生的?"此外,可能有一些词干乍一看对我有意义,但也许我忽略了它们实际上包含不同含义的词这一事实。

我想应用SO答案中描述的策略/技术,在词干过程中保留特定术语(例如,防止"自然"one_answers"归化"成为同一词干术语。tm包词干的文本挖掘

但为了更全面地做到这一点,我想看到一个列表,列出所有与我最频繁的词干相对应的单独单词。有没有一种方法可以找到那些在词干时产生我的mostFreqTerms列表的单词?

编辑:可复制的示例

textVector = c("Trisha Takinawa: Here comes Mayor Adam West 
himself. Mr. West do you have any words 
for our viewers?Mayor Adam West: Box toaster
aluminum maple syrup... no I take that one 
back. Im gonna hold onto that one. 
Now MaxPower is adding adamant
so this example works")
corpus = Corpus(VectorSource(textVector))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, PlainTextDocument) 
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeWords, c(stopwords("english")))
corpus = tm_map(corpus, stemDocument, language="english")
dtm = DocumentTermMatrix(corpus)
mostFreqTerms = findFreqTerms(dtm, lowfreq=2) 
mostFreqTerms

上面的mostFreqTerms输出

[1]"adam"one"west">

我正在寻找一种程序化的方法来确定词干"adam"来自于原始单词"adam"one_answers"stadid"。

这里可以看到词干单词"west"来自单词"west'"、"west'"one_answers"wester"。

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import RSLPStemmer
import string 
st = RSLPStemmer()
punctuations = list(string.punctuation)
textVector = "Trisha Takinawa: Here comes Mayor adams West himself. Mr. 
West do you have any words for our viewers?Mayor Adam Wester: 
Box toaster aluminum maple syrup... no I take that one back. Im gonna hold 
onto that one. Now MaxPower is adding adamant so this example works"
tokens = word_tokenize(textVector.lower())
tokens = [w for w in tokens if not w in punctuations]
filtered_words = [w for w in tokens if not w in stopwords.words('english')]
steammed_words = [st.stem(w) for w in filtered_words ]
allWordDist = nltk.FreqDist(w for w in steammed_words)
for w in allWordDist.most_common(2):
for i in range(len(steammed_words)):
if steammed_words[i] == w[0]:
print str(w[0])+"="+ filtered_words[i]

west=西

west=西

west=西部

ad=亚当斯

ad=adam

最新更新