学习反矢量器的部分拟合



CountVectorizer是否支持部分拟合?

我想用不同批次的数据来训练CountVectorizer

不支持部分拟合。

但是你可以写一个简单的方法来实现你的目标:

def partial_fit(self , data):
    if(hasattr(vectorizer , 'vocabulary_')):
        vocab = self.vocabulary_
    else:
        vocab = {}
    self.fit(data)
    vocab = list(set(vocab.keys()).union(set(self.vocabulary_ )))
    self.vocabulary_ = {vocab[i] : i for i in range(len(vocab))}
from sklearn.feature_extraction.text import CountVectorizer
CountVectorizer.partial_fit = partial_fit
vectorizer = CountVectorizer(stop_words=l)
vectorizer.fit(df[15].values[0:100])
vectorizer.partial_fit(df[15].values[100:200])

sajiad的实现是正确的,我很感谢他们分享他们的解决方案。通过将对hasattr()的调用修改为引用self而不是vectorizer,可以使其更加灵活。

我用下面一个简短的可重复的例子来实现这一点,说明partial_fit()fit()的作用:

def partial_fit(self , data):
    if(hasattr(self , 'vocabulary_')):
        vocab = self.vocabulary_
    else:
        vocab = {}
    self.fit(data)
    vocab = list(set(vocab.keys()).union(set(self.vocabulary_ )))
    self.vocabulary_ = {vocab[i] : i for i in range(len(vocab))}
from sklearn.feature_extraction.text import CountVectorizer
CountVectorizer.partial_fit = partial_fit
vectorizer = CountVectorizer()
corpus = ['The quick brown fox',
'jumps over the lazy dog']
# Without partial fit
for i in corpus:
    vectorizer.fit([i])
print(vectorizer.get_feature_names())

['dog', ' jumping ', 'lazy', 'over', 'the']

# With partial fit
for i in corpus:
    vectorizer.partial_fit([i])
print(vectorizer.get_feature_names())

['/',‘狐狸’,‘懒惰’,‘快’,‘的’,‘跳’,‘狗’,'布朗']

相关内容

  • 没有找到相关文章

最新更新