小贝子编程

使用gensim的Python单词频率:如何在语料库中获得单词而不是id单词

本文关键字：单词语料库 id Python gensim 频率使用 python text-mining gensim
更新时间 : 2023-09-20
英文 : Python frequency of words using gensim: How to get the word instead of id word in corpus

我使用gensim来计算给定音符中单词的频率。

应用以下代码后：

from gensim import corpora
dictionary = corpora.Dictionary(sentences) 
corpus = [dictionary.doc2bow(text) for text in sentences]

获得语料库，例如：[(0，1(，(1，5(，(3，1(…]

我想要这样的语料库：[(字_1，1(，(字_2，5(，(词_3，1(…]

所以我想在语料库中获取单词而不是id单词。

有人能帮我如何获得这个，然后将这样一个语料库保存为excel文件吗？

根据文档，单词映射可以在dictionary.token2id中找到。为了快速查找，让我们反转dictionary.token2id的键值映射，并应用列表理解：

mapping = {v: k for k, v in dictionary.token2id.items()}
[(mapping[i[0]], i[1]) for i in corpus]

但是，当您在样本数据中使用列表理解时，corpus可能会包含列表列表。在这种情况下，您可以使用：

[[(mapping[i[0]], i[1]) for i in item] for item in corpus]

相关内容