如何在python列表中词干



我有如下所示的python列表

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

现在我需要把它(每个单词)词干,然后得到另一个列表。我该怎么做?

from stemming.porter2 import stem
documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]
documents = [[stem(word) for word in sentence.split(" ")] for sentence in documents]

我们在这里所做的是使用列表理解来循环遍历主列表中的每个字符串,并将其拆分为单词列表。然后,我们循环浏览该列表,边走边给每个单词加词干,返回新的词干列表。

请注意,我还没有在安装词干的情况下尝试过这个——我从评论中得到了这个,我自己也从未使用过。然而,这是将列表拆分为单词的基本概念。请注意,这将生成一个单词列表,保持原始分隔。

如果不想这种分离,你可以做:

documents = [stem(word) for sentence in documents for word in sentence.split(" ")]

相反,这会给你留下一个连续的列表。

如果你想在结尾处将单词重新组合在一起,你可以这样做:

documents = [" ".join(sentence) for sentence in documents]

或者在一行中完成:

documents = [" ".join([stem(word) for word in sentence.split(" ")]) for sentence in documents]

在何处保留句子结构或

documents = " ".join(documents)

其中忽略它。

您可能想了解一下NLTK(自然语言工具包)。它有一个模块nltk.stem,其中包含各种不同的词干器。

另请参阅此问题。

好的。所以,使用词干包,你会得到这样的东西:

from stemming.porter2 import stem
from itertools import chain
def flatten(listOfLists):
    "Flatten one level of nesting"
    return list(chain.from_iterable(listOfLists))
def stemall(documents):
    return flatten([ [ stem(word) for word in line.split(" ")] for line in documents ])

您可以使用NLTK:

from nltk.stem import PorterStemmer

ps = PorterStemmer()
final = [[ps.stem(token) for token in sentence.split(" ")] for sentence in documents]

NLTK有许多用于IR系统的功能,请检查

from nltk.stem import PorterStemmer
ps = PorterStemmer()
list_stem = [ps.stem(word) for word in list]

您可以使用whoosh:(http://whoosh.readthedocs.io/)

from whoosh.analysis import CharsetFilter, StemmingAnalyzer
from whoosh import fields
from whoosh.support.charset import accent_map
my_analyzer = StemmingAnalyzer() | CharsetFilter(accent_map)
tokens = my_analyzer("hello you, comment ça va ?")
words = [token.text for token in tokens]
print(' '.join(words))

您可以使用PorterStemmer或LancasterStemer进行词干处理。