我是scikit-learn的新手,我使用TfidfVectorizer
在一组文档中查找术语的tfidf值。我使用下面的代码来获得相同的结果。
vectorizer = TfidfVectorizer(stop_words=u'english',ngram_range=(1,5),lowercase=True)
X = vectorizer.fit_transform(lectures)
现在如果我打印X,我可以看到矩阵中的所有条目,但是我如何根据tfidf分数找到前n个条目呢?除此之外,有没有什么方法可以帮助我根据每个ngram的tfidf分数找到前n个条目,也就是说,在unigram,bigram,trigram等等中找到前n个条目?
从0.15版本开始,TfidfVectorizer
学习到的特征的全局项权重可以通过属性idf_
来访问,它将返回一个长度等于特征维度的数组。按此权重对特征进行排序,以获得权重最高的特征:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
lectures = ["this is some food", "this is some drink"]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(lectures)
indices = np.argsort(vectorizer.idf_)[::-1]
features = vectorizer.get_feature_names()
top_n = 2
top_features = [features[i] for i in indices[:top_n]]
print top_features
输出:[u'food', u'drink']
通过ngram获得顶级特征的第二个问题可以使用相同的思想,通过一些额外的步骤将特征分成不同的组:
from sklearn.feature_extraction.text import TfidfVectorizer
from collections import defaultdict
lectures = ["this is some food", "this is some drink"]
vectorizer = TfidfVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(lectures)
features_by_gram = defaultdict(list)
for f, w in zip(vectorizer.get_feature_names(), vectorizer.idf_):
features_by_gram[len(f.split(' '))].append((f, w))
top_n = 2
for gram, features in features_by_gram.iteritems():
top_features = sorted(features, key=lambda x: x[1], reverse=True)[:top_n]
top_features = [f[0] for f in top_features]
print '{}-gram top:'.format(gram), top_features
输出:1-gram top: [u'drink', u'food']
2-gram top: [u'some drink', u'some food']