关于Gensim的问题从字典中创建语料库

我是Gensim的新手，我正在学习Gensim，并遵循了这里的例子：https://www.machinelearningplus.com/nlp/gensim-tutorial/

我不确定从字典中创建语料库的最后一行。在创建字典时，我们已经使用 simple_preprocess 逐行处理"文档"。我在想，在使用字典创建语料库时，我们需要再次使用simple_preprocess逐行处理"文档"。这是多余的吗？

documents = ["This is the first line",
"This is the second sentence",
"This third document"]
# Create the Dictionary and Corpus
mydict = corpora.Dictionary([simple_preprocess(line) for line in documents])
# Why need to use simple_preprocess and pass the documents again while
# the last call already created the dictionary using simple_preporcess on documents
corpus = [mydict.doc2bow(simple_preprocess(line)) for line in documents]

谢谢

亚历克斯

Dictionary对象将语料库中的每个单词映射到唯一 id，而doc2bow()则基于提供的字典创建词袋 (BoW( 模型。

在我看来，最好将 Sci-kit learn 中的CountVectorizer用于 BoW 模型，因为它带有一些 Gensim 实现中缺少的有用参数，例如min_df和max_df(见这里(。

相关内容

最新更新

热门标签：