PorterStemmer()对句子中的最后一个单词进行了不同的修饰



我有以下离线环境的代码:

import pandas as pd
import re
from nltk.stem import PorterStemmer
test = {'grams':  ['First value because one does two THREE', 'Second value because three and three four', 'Third donkey three']}
test = pd.DataFrame(test, columns = ['grams'])
STOPWORDS = {'and', 'does', 'because'}
def rower(x):
cleanQ = []  
for i in range(len(x)): cleanQ.append(re.sub(r'[b()\"'/[]s+,.:?;]', ' ', x[i]).lower())

splitQ = []
for row in cleanQ: splitQ.append(row.split())
splitQ[:] = [[word for word in sub if word not in STOPWORDS] for sub in splitQ]
splitQ = list(map(' '.join, splitQ))
print(splitQ)

originQ = []    
for i in splitQ: 
originQ.append(PorterStemmer().stem(i))
print(originQ)

rower(test.grams)

产生这个:

['first value one two three', 'second value three three four', 'third donkey three']
['first value one two thre', 'second value three three four', 'third donkey thre']

第一个列表显示了应用PorterStemmer()函数之前的句子。第二个列表显示了应用PorterStemmer()函数后的句子。

正如您所看到的,只有当单词被定位为句子中的最后一个单词时,PorterStemmer()才会将单词three修剪为thre。当单词three不是最后一个单词时,three保持为three。我似乎不明白它为什么要这么做。我还担心,如果我将rower(x)函数应用于其他句子,它可能会在我没有注意到的情况下产生类似的结果。

如何防止PorterStemmer以不同的方式对待最后一个单词?

这里的主要错误是将多个单词传递给词干生成器,而不是一次传递一个单词。整个字符串(第三个驴三(被认为是一个单词,最后一部分正在被词干。

import pandas as pd
import re
from nltk.stem import PorterStemmer
test = {'grams': ['First value because one does two THREE', 'Second value because three and three four',
'Third donkey three']}
test = pd.DataFrame(test, columns=['grams'])
STOPWORDS = {'and', 'does', 'because'}
ps = PorterStemmer()
def rower(x):
cleanQ = []
for i in range(len(x)): cleanQ.append(re.sub(r'[b()\"'/[]s+,.:?;]', ' ', x[i]).lower())
splitQ = []
for row in cleanQ: splitQ.append(row.split())
splitQ = [[word for word in sub if word not in STOPWORDS] for sub in splitQ]
print('IN:', splitQ)
originQ = [[ps.stem(word) for word in sent] for sent in splitQ]
print('OUT:', originQ)

rower(test.grams)

输出:

IN: [['first', 'value', 'one', 'two', 'three'], ['second', 'value', 'three', 'three', 'four'], ['third', 'donkey', 'three']]
OUT: [['first', 'valu', 'one', 'two', 'three'], ['second', 'valu', 'three', 'three', 'four'], ['third', 'donkey', 'three']]

有很好的解释为什么词干省略了一些单词的最后一个"e"。如果输出不能满足您的期望,请考虑使用旅鼠化器。

如何阻止NLTK词干删除尾随的"e"?

相关内容

最新更新