我试图用Gensim主题模型回收Scikit-Learn vectorizer对象。原因很简单:首先,我已经有大量的矢量数据。其次,我更喜欢Scikit-Learn矢量器的界面和灵活性;第三,即使使用Gensim进行主题建模非常快,但计算其词典(Dictionary()
)的经验相对较慢。
之前已经提出过类似的问题,尤其是这里和这里,桥接解决方案是Gensim的Sparse2Corpus()
函数,它将Scipy稀疏矩阵转换为Gensim coppus对象。
但是,此转换不利用Sklearn矢量器的vocabulary_
属性,该属性可以保存单词和特征ID之间的映射。为了打印每个主题的判别单词(Gensim主题模型中的id2word
),该映射是必要的,被描述为"从单词IDS(Integers)到单词(strings)的映射")。
我知道Gensim的Dictionary
对象比Scikit的vect.vocabulary_
(一个简单的Python dict
)...
在Gensim模型中使用vect.vocabulary_
作为id2word
的任何想法?
一些示例代码:
# our data
documents = [u'Human machine interface for lab abc computer applications',
u'A survey of user opinion of computer system response time',
u'The EPS user interface management system',
u'System and human system engineering testing of EPS',
u'Relation of user perceived response time to error measurement',
u'The generation of random binary unordered trees',
u'The intersection graph of paths in trees',
u'Graph minors IV Widths of trees and well quasi ordering',
u'Graph minors A survey']
from sklearn.feature_extraction.text import CountVectorizer
# compute vector space with sklearn
vect = CountVectorizer(min_df=1, ngram_range=(1, 1), max_features=25000)
corpus_vect = vect.fit_transform(documents)
# each doc is a scipy sparse matrix
print vect.vocabulary_
#{u'and': 1, u'minors': 20, u'generation': 9, u'testing': 32, u'iv': 15, u'engineering': 5, u'computer': 4, u'relation': 28, u'human': 11, u'measurement': 19, u'unordered': 37, u'binary': 3, u'abc': 0, u'for': 8, u'ordering': 23, u'graph': 10, u'system': 31, u'machine': 17, u'to': 35, u'quasi': 26, u'time': 34, u'random': 27, u'paths': 24, u'of': 21, u'trees': 36, u'applications': 2, u'management': 18, u'lab': 16, u'interface': 13, u'intersection': 14, u'response': 29, u'perceived': 25, u'in': 12, u'widths': 40, u'well': 39, u'eps': 6, u'survey': 30, u'error': 7, u'opinion': 22, u'the': 33, u'user': 38}
import gensim
# transform sparse matrix into gensim corpus
corpus_vect_gensim = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)
lsi = gensim.models.LsiModel(corpus_vect_gensim, num_topics=4)
# I instead would like something like this line below
# lsi = gensim.models.LsiModel(corpus_vect_gensim, id2word=vect.vocabulary_, num_topics=2)
print lsi.print_topics(2)
#['0.622*"21" + 0.359*"31" + 0.256*"38" + 0.206*"29" + 0.206*"34" + 0.197*"36" + 0.170*"33" + 0.168*"1" + 0.158*"10" + 0.147*"4"', '0.399*"36" + 0.364*"10" + -0.295*"31" + 0.245*"20" + -0.226*"38" + 0.194*"26" + 0.194*"15" + 0.194*"39" + 0.194*"23" + 0.194*"40"']
Gensim不需要Dictionary
对象。您可以将平原dict
直接用作id2word
的输入,只要它映射ID(整数)到单词(字符串)。
实际上,任何类似于dict的事情都会做(包括 dict
, Dictionary
, SqliteDict
...)。
(顺便说一句,Gensim的Dictionary
是一个简单的Python dict
。不确定您对Dictionary
性能的评论来自何处,您无法比Python中的普通dict
更快地获得映射。也许您将其与文本预处理(不是Gensim的一部分)混淆,这确实可以很慢。)
只是为了提供最终的示例,Scikit-learn的矢量器对象可以用Sparse2Corpus
转换为Gensim的语料库,而词汇dict
可以通过简单地交换键和值来回收:
# transform sparse matrix into gensim corpus
corpus_vect_gensim = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)
# transform scikit vocabulary into gensim dictionary
vocabulary_gensim = {}
for key, val in vect.vocabulary_.items():
vocabulary_gensim[val] = key
我还使用这两个运行一些代码实验。显然,现在有一种方法可以从语料库构造字典
from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary.from_corpus(corpus_vect_gensim,
id2word=dict((id, word) for word, id in vect.vocabulary_.items()))
然后,您可以将此字典用于TFIDF,LSI或LDA型号。
工作python 3代码中的解决方案。
import gensim
from gensim.corpora.dictionary import Dictionary
from sklearn.feature_extraction.text import CountVectorizer
def vect2gensim(vectorizer, dtmatrix):
# transform sparse matrix into gensim corpus and dictionary
corpus_vect_gensim = gensim.matutils.Sparse2Corpus(dtmatrix, documents_columns=False)
dictionary = Dictionary.from_corpus(corpus_vect_gensim,
id2word=dict((id, word) for word, id in vectorizer.vocabulary_.items()))
return (corpus_vect_gensim, dictionary)
documents = [u'Human machine interface for lab abc computer applications',
u'A survey of user opinion of computer system response time',
u'The EPS user interface management system',
u'System and human system engineering testing of EPS',
u'Relation of user perceived response time to error measurement',
u'The generation of random binary unordered trees',
u'The intersection graph of paths in trees',
u'Graph minors IV Widths of trees and well quasi ordering',
u'Graph minors A survey']
# compute vector space with sklearn
vect = CountVectorizer(min_df=1, ngram_range=(1, 1), max_features=25000)
corpus_vect = vect.fit_transform(documents)
# transport to gensim
(gensim_corpus, gensim_dict) = vect2gensim(vect, corpus_vect)
提出答案,因为我还没有50的声誉。
直接使用vect.vocabulary_(带键和值互换)将在python 3上起作用,因为dict.keys()现在返回一个可值得的视图而不是列表。关联的错误是:
TypeError: can only concatenate list (not "dict_keys") to list
要在python 3上进行此项工作,请在lsimodel.py中将行301更改为
self.num_terms = 1 + max([-1] + list(self.id2word.keys()))
希望这会有所帮助。
使用Scikit令牌和停止字样作为唯一的区别
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
import gensim
from gensim import models
print("Text Similarity with Gensim and Scikit utils")
# compute vector space with sklearn
documents = [
"Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey",
]
# Using Scikit learn feature extractor
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(min_df=1, ngram_range=(1, 1), stop_words='english')
corpus_vect = vect.fit_transform(documents)
# take the dict keys out
texts = list(vect.vocabulary_.keys())
from gensim import corpora
dictionary = corpora.Dictionary([texts])
# transform scikit vocabulary into gensim dictionary
corpus_vect_gensim = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)
# create LSI model
lsi = models.LsiModel(corpus_vect_gensim, id2word=dictionary, num_topics=2)
# convert the query to LSI space
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_lsi = lsi[vec_bow]
print(vec_lsi)
# Find similarities
from gensim import similarities
index = similarities.MatrixSimilarity(lsi[corpus_vect_gensim]) # transform corpus to LSI space and index it
sims = index[vec_lsi] # perform a similarity query against the corpus
print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples
sims = sorted(enumerate(sims), key=lambda item: -item[1])
for doc_position, doc_score in sims:
print(doc_score, documents[doc_position])