计数令牌化后的令牌，停止删除单词和茎

我具有以下功能：

def preprocessText (data):
    stemmer = nltk.stem.porter.PorterStemmer()
    preprocessed = []
    for each in data:
        tokens = nltk.word_tokenize(each.lower().translate(string.punctuation))
        filtered = [word for word in tokens if word not in nltk.corpus.stopwords.words('english')]
        preprocessed.append([stemmer.stem(item) for item in filtered])
    print(Counter(tokens).most_common(10))
    return (np.array(preprocessed))

应该删除标点符号，令牌，删除停止单词并使用搬运工茎。但是，它无法正常工作。例如，当我运行此代码时：

s = ["The cow and of.", "and of dog the."]
print (Counter(preprocessText(s)))

它产生此输出：

[('and', 1), ('.', 1), ('dog', 1), ('the', 1), ('of', 1)]

不删除标点符号或停止词。

您的翻译并不努力删除标点符号。这是一些工作代码。我进行了一些更改，其中最重要的是：

代码：

xlate = {ord(x): y for x, y in
         zip(string.punctuation, ' ' * len(string.punctuation))}
tokens = nltk.word_tokenize(each.lower().translate(xlate))

测试代码：

from collections import Counter
import nltk
import string
stopwords = set(nltk.corpus.stopwords.words('english'))
try:
    # python 2
    xlate = string.maketrans(
        string.punctuation, ' ' * len(string.punctuation))
except AttributeError:
    xlate = {ord(x): y for x, y in
             zip(string.punctuation, ' ' * len(string.punctuation))}
def preprocessText(data):
    stemmer = nltk.stem.porter.PorterStemmer()
    preprocessed = []
    for each in data:
        tokens = nltk.word_tokenize(each.lower().translate(xlate))
        filtered = [word for word in tokens if word not in stopwords]
        preprocessed.append([stemmer.stem(item) for item in filtered])
    return np.array(preprocessed)
s = ["The cow and of.", "and of dog the."]
print(Counter(sum([list(x) for x in preprocessText(s)], [])))

结果：

Counter({'dog': 1, 'cow': 1})

问题是您正在滥用translate。要正确使用它，您需要制作一个映射表，该表(如帮助字符串会告诉您(映射" Unicode序列到Unicode Ordinals，字符串或无。"。例如，这样的：

>>> mapping = dict((ord(x), None) for x in string.punctuation)  # `None` means "delete"
>>> print("This.and.that".translate(mapping))
'Thisandthat'

但是，如果您对单词令牌这样做，您只是用空字符串替换标点令牌。您可以添加一个步骤以摆脱它们，但我建议您简单地选择您想要的东西：即字母数字单词。

tokens = nltk.word_tokenize(each.lower() if each.isalnum())

这就是您需要更改代码的全部。

代码：

测试代码：

结果：

相关内容

最新更新

热门标签：