如何使用切片去除单词末尾的几个不同词干



虽然我知道有NLTK这样的工具可以帮我做到这一点,但我想了解如何有效地在列表中分割多个词干。

说我的单词列表是;

list = ["another", "cats", "walrus", "relaxed", "annoyingly", "rest", "normal", "hopping", "classes", "wing", "feed"]

我想去除的常见茎可能是;

stems = ["s", "es", "ed", "est", "ing", "ly"] etc

用我不想词干指定为;

noStem = ["walrus", "rest", "wing", "feed"]

我已经想出了如何为一个特定的词干做这件事,比如"s"。例如,我的代码是;

for eachWord in list:
if eachWord not in noStem:
if eachWord[-1] == "s":
eachWord = eachWord[:-1]
stemmedList = stemmedList + [eachWord]

我不知道如何以更有效的方式将其应用于我的所有茎。

谢谢你的帮助和建议!

我建议您将noStem转换为set,以便快速检查if eachWord not in noStem。然后你可以检查单词endswithstems中是否有词干。如果是这样,你可以使用匹配的最大词干,并将其从单词中删除:

lst = ["another", "cats", "walrus", "relaxed", "annoyingly", "rest", "normal", "hopping", "classes", "wing", "feed"]
stems = ["s", "es", "ed", "est", "ing", "ly"]
noStem = {"walrus", "rest", "wing", "feed"}
stemmedList = []
for word in lst:
if word in noStem or not any([word.endswith(stem) for stem in stems]):
stemmedList.append(word)
else:
stem = max([s for s in stems if word.endswith(s)], key=len)
stemmedList.append(word[:len(word) - len(stem)])
print(stemmedList)
# ['another', 'cat', 'walrus', 'relax', 'annoying', 'rest', 'normal', 'hopp', 'class', 'wing', 'feed']

它比这复杂得多,但这里有一个使用更快的panda模块的启动代码。它来了。

import pandas as pd
import re
word_list = ["another", "cats", "walrus", "relaxed", "annoyingly", "rest", "normal", "hopping", "classes", "wing", "feed"]
stems = ["es",  "ed", "est", "ing", "ly", "s"]
# a set for quick lookup 
noStem = set(["walrus", "rest", "wing", "feed"])
# build series
words = pd.Series(word_list)
# filter out words in noStem
words = words[words.apply(lambda x: x not in noStem)]
# compile regular explession - performance - join all stems for matching
term_matching = '|'.join(stems)
expr = re.compile(r'(.+?)({})$'.format(term_matching))
df = words.str.extract(expr, expand=True)
df.dropna(how='any', inplace=True)
df.columns = ['words', 'stems']
stemmed_list = df.words.tolist()

我希望它能帮助。。。

我认为这是一个不错的开始。您只需要添加第二个循环就可以使用多个结尾。你可以尝试下面的方法,(你会注意到我已经将变量重命名为list,因为变量隐藏内置名称很危险(

stemmed_list = []
for word in word_list:
if word not in noStem:
for ending in stems:
if word.endswith(ending):
word = word[:-len(ending)]
break   # This will prevent iterating over all endings once match is found
stemmed_list.append(word)

或者,如果根据您的评论,您不想使用endswith

stemmed_list = []
for word in word_list:
if word not in noStem:
for ending in stems:
if word[-len(ending):] == ending:
word = word[:-len(ending)]
break   # This will prevent iterating over all endings once match is found
stemmed_list.append(word)

最新更新