虽然我知道有NLTK这样的工具可以帮我做到这一点,但我想了解如何有效地在列表中分割多个词干。
说我的单词列表是;
list = ["another", "cats", "walrus", "relaxed", "annoyingly", "rest", "normal", "hopping", "classes", "wing", "feed"]
我想去除的常见茎可能是;
stems = ["s", "es", "ed", "est", "ing", "ly"] etc
用我不想词干指定为;
noStem = ["walrus", "rest", "wing", "feed"]
我已经想出了如何为一个特定的词干做这件事,比如"s"。例如,我的代码是;
for eachWord in list:
if eachWord not in noStem:
if eachWord[-1] == "s":
eachWord = eachWord[:-1]
stemmedList = stemmedList + [eachWord]
我不知道如何以更有效的方式将其应用于我的所有茎。
谢谢你的帮助和建议!
我建议您将noStem
转换为set
,以便快速检查if eachWord not in noStem
。然后你可以检查单词endswith
在stems
中是否有词干。如果是这样,你可以使用匹配的最大词干,并将其从单词中删除:
lst = ["another", "cats", "walrus", "relaxed", "annoyingly", "rest", "normal", "hopping", "classes", "wing", "feed"]
stems = ["s", "es", "ed", "est", "ing", "ly"]
noStem = {"walrus", "rest", "wing", "feed"}
stemmedList = []
for word in lst:
if word in noStem or not any([word.endswith(stem) for stem in stems]):
stemmedList.append(word)
else:
stem = max([s for s in stems if word.endswith(s)], key=len)
stemmedList.append(word[:len(word) - len(stem)])
print(stemmedList)
# ['another', 'cat', 'walrus', 'relax', 'annoying', 'rest', 'normal', 'hopp', 'class', 'wing', 'feed']
它比这复杂得多,但这里有一个使用更快的panda模块的启动代码。它来了。
import pandas as pd
import re
word_list = ["another", "cats", "walrus", "relaxed", "annoyingly", "rest", "normal", "hopping", "classes", "wing", "feed"]
stems = ["es", "ed", "est", "ing", "ly", "s"]
# a set for quick lookup
noStem = set(["walrus", "rest", "wing", "feed"])
# build series
words = pd.Series(word_list)
# filter out words in noStem
words = words[words.apply(lambda x: x not in noStem)]
# compile regular explession - performance - join all stems for matching
term_matching = '|'.join(stems)
expr = re.compile(r'(.+?)({})$'.format(term_matching))
df = words.str.extract(expr, expand=True)
df.dropna(how='any', inplace=True)
df.columns = ['words', 'stems']
stemmed_list = df.words.tolist()
我希望它能帮助。。。
我认为这是一个不错的开始。您只需要添加第二个循环就可以使用多个结尾。你可以尝试下面的方法,(你会注意到我已经将变量重命名为list
,因为变量隐藏内置名称很危险(
stemmed_list = []
for word in word_list:
if word not in noStem:
for ending in stems:
if word.endswith(ending):
word = word[:-len(ending)]
break # This will prevent iterating over all endings once match is found
stemmed_list.append(word)
或者,如果根据您的评论,您不想使用endswith
stemmed_list = []
for word in word_list:
if word not in noStem:
for ending in stems:
if word[-len(ending):] == ending:
word = word[:-len(ending)]
break # This will prevent iterating over all endings once match is found
stemmed_list.append(word)