每次加载一个文档,而不是一次将所有文档保存在内存中。
我有两个文档doc1.txt
和doc2.txt
。这两个文件的内容是:
#doc1.txt
very good, very bad, you are great
#doc2.txt
very bad, good restaurent, nice place to visit
我想让我的语料与,
分开,这样我最终的DocumentTermMatrix
变成:
terms
docs very good very bad you are great good restaurent nice place to visit
doc1 tf-idf tf-idf tf-idf 0 0
doc2 0 tf-idf 0 tf-idf tf-idf
我知道,如何计算单个单词的DocumentTermMatrix
(使用http://scikit-learn.org/stable/modules/feature_extraction.html),但不知道如何在Python中计算strings
的DocumentTermMatrix
。
您可以将TfidfVectorizer
的analyzer
参数指定为一个函数,该函数以自定义的方式提取特征:
from sklearn.feature_extraction.text import TfidfVectorizer
docs = ['very good, very bad, you are great',
'very bad, good restaurent, nice place to visit']
tfidf = TfidfVectorizer(analyzer=lambda d: d.split(', ')).fit(docs)
print tfidf.get_feature_names()
得到的特征如下:
['good restaurent', 'nice place to visit', 'very bad', 'very good', 'you are great']
如果你真的负担不起将所有的数据加载到内存中,这是一个解决方案:
from sklearn.feature_extraction.text import TfidfVectorizer
docs = ['doc1.txt', 'doc2.txt']
def extract(filename):
with open(filename) as f:
features = []
for line in f:
features += line.strip().split(', ')
return features
tfidf = TfidfVectorizer(analyzer=extract).fit(docs)
print tfidf.get_feature_names()