词干后寻找单词的原始形式

我正在生成一个单词列表，并从中生成一个数据帧。原始数据如下：

original = 'The man who flies the airplane dies in an air crash. His wife died a couple of weeks ago.'
df = pd.DataFrame({'text':[original]})

我用于引理和词干的函数是：

# lemmatize & stemmed.
def lemmatize_stemming(text):
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS:
result.append(lemmatize_stemming(token))
return result

输出将来自运行df['text'].map(preprocess)[0]，我得到：

['man',
'fli',
'airplan',
'die',
'air',
'crash',
'wife',
'die',
'coupl',
'week',
'ago']

我想知道如何将输出返回到原始令牌？例如，我已经死了，那是从死到死。

Stemming通过不可逆地将多个标记转换为一些共享的"词干"形式来破坏原始语料库中的信息。

我你想要原文，你需要自己保留。

但也要注意：许多处理大量数据的算法，如理想条件下的word2vec，不一定需要词根生成，甚至不从中受益。你希望原始文本中所有单词都有向量，而不仅仅是词干，如果有足够的数据，一个单词的相关形式就会得到类似的向量。(事实上，它们甚至会在有用的方面有所不同，所有的"过去时"或"状语"或任何变体都有类似的方向倾斜。(

所以，只有当你确信它对你的目标有帮助时，才去做，在你的语料库限制范围内；目标。

您可以将映射关系作为结果返回，并在稍后执行后处理。

def preprocess(text):
lemma_mapping = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS:
lemma_mapping[token] = lemmatize_stemming(token)
return lemma_mapping

或者将其作为副产品储存。

from collections import defaultdict
lemma_mapping = defaultdict(str)
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS:
lemma = lemmatize_stemming(token)
result.append(lemma)
lemma_mapping[token] = lemma
return result

相关内容

最新更新

热门标签：