我有这个代码用于计算与tf idf的文本相似性。
from sklearn.feature_extraction.text import TfidfVectorizer
documents = [doc1,doc2]
tfidf = TfidfVectorizer().fit_transform(documents)
pairwise_similarity = tfidf * tfidf.T
print pairwise_similarity.A
问题是,这段代码以纯字符串作为输入,我想通过删除停止字、词干和标记化来准备文档。因此,输入将是一个列表。如果我用标记化的文档调用documents = [doc1,doc2]
,则错误为:
Traceback (most recent call last):
File "C:UserstasosDesktopmy thesisbetasimilarity.py", line 18, in <module>
tfidf = TfidfVectorizer().fit_transform(documents)
File "C:Python27libsite-packagesscikit_learn-0.14.1-py2.7-win32.eggsklearnfeature_extractiontext.py", line 1219, in fit_transform
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
File "C:Python27libsite-packagesscikit_learn-0.14.1-py2.7-win32.eggsklearnfeature_extractiontext.py", line 780, in fit_transform
vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
File "C:Python27libsite-packagesscikit_learn-0.14.1-py2.7-win32.eggsklearnfeature_extractiontext.py", line 715, in _count_vocab
for feature in analyze(doc):
File "C:Python27libsite-packagesscikit_learn-0.14.1-py2.7-win32.eggsklearnfeature_extractiontext.py", line 229, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:Python27libsite-packagesscikit_learn-0.14.1-py2.7-win32.eggsklearnfeature_extractiontext.py", line 195, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'unicode' object has no attribute 'apply_freq_filter'
有什么方法可以更改代码并使其接受列表,或者让我再次将标记化的文档更改为字符串?
尝试跳过预处理到小写并提供您自己的"nop"标记器:
tfidf = TfidfVectorizer(tokenizer=lambda doc: doc, lowercase=False).fit_transform(documents)
您还应该检查其他参数,如stop_words
,以避免重复预处理。