你能在scikit-learn中添加一个CountVectorizer吗?



我想在scikit-learn中基于文本语料库创建一个CountVectorizer,然后稍后将更多文本添加到CountVectorizer中(添加到原始字典中)。

如果我使用transform(),它确实保留了原始词汇,但没有添加新单词。 如果我使用 fit_transform() ,它只是从头开始重新生成词汇。 见下文:

In [2]: count_vect = CountVectorizer()
In [3]: count_vect.fit_transform(["This is a test"])
Out[3]: 
<1x3 sparse matrix of type '<type 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>
In [4]: count_vect.vocabulary_  
Out[4]: {u'is': 0, u'test': 1, u'this': 2}
In [5]: count_vect.transform(["This not is a test"])
Out[5]: 
<1x3 sparse matrix of type '<type 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>
In [6]: count_vect.vocabulary_
Out[6]: {u'is': 0, u'test': 1, u'this': 2}
In [7]: count_vect.fit_transform(["This not is a test"])
Out[7]: 
<1x4 sparse matrix of type '<type 'numpy.int64'>'
    with 4 stored elements in Compressed Sparse Row format>
In [8]: count_vect.vocabulary_
Out[8]: {u'is': 0, u'not': 1, u'test': 2, u'this': 3}

我想要相当于update()函数。 我希望它像这样工作:

In [2]: count_vect = CountVectorizer()
In [3]: count_vect.fit_transform(["This is a test"])
Out[3]: 
<1x3 sparse matrix of type '<type 'numpy.int64'>'
    with 3 stored elements in Compressed Sparse Row format>
In [4]: count_vect.vocabulary_  
Out[4]: {u'is': 0, u'test': 1, u'this': 2}
In [5]: count_vect.update(["This not is a test"])
Out[5]: 
<1x3 sparse matrix of type '<type 'numpy.int64'>'
    with 4 stored elements in Compressed Sparse Row format>
In [6]: count_vect.vocabulary_
Out[6]: {u'is': 0, u'not': 1, u'test': 2, u'this': 3}

有没有办法做到这一点?

scikit-learn中实现的算法旨在同时适应所有数据,这对于大多数 ML 算法都是必需的(尽管不是您描述的应用程序感兴趣),因此没有update功能。

有一种方法可以通过稍微不同的方式来获得您想要的东西,请参阅以下代码

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
count_vect = CountVectorizer()
count_vect.fit_transform(["This is a test"])
print count_vect.vocabulary_
count_vect.fit_transform(["This is a test", "This is not a test"])
print count_vect.vocabulary_

哪些输出

{u'this': 2, u'test': 1, u'is': 0}
{u'this': 3, u'test': 2, u'is': 0, u'not': 1}

相关内容

  • 没有找到相关文章

最新更新