对于DictVectorizer,可以使用restrict((方法对对象进行子集化。 下面是一个示例,其中我显式列出了使用布尔数组要保留的功能。
import numpy as np
v = DictVectorizer()
D = [{'foo': 1, 'bar': 2}, {'foo': 3, 'baz': 1}]
X = v.fit_transform(D)
v.get_feature_names()
>>['bar', 'baz', 'foo']
user_list = np.array([False, False, True], dtype=bool)
v.restrict(user_list)
v.get_feature_names()
>>['foo']
我想在非规范化的 CountVectorizer 对象中拥有相同的能力。 我还没有发现任何方法来切片来自 CountVectorizer 的 np 对象,因为有许多依赖属性。 我感兴趣的原因是,这消除了在文本数据第一次拟合和转换后简单地删除特征的情况下重复拟合和转换文本数据的需要。 是否有我缺少的等效方法,或者是否可以为 CountVectorizer 轻松创建自定义方法?
根据@Vivek的回应进行更新
此方法似乎有效。 这是我直接在 python 会话中实现这一点的代码。
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
v = CountVectorizer()
D = ['Data science is about the data', 'The science is amazing', 'Predictive modeling is part of data science']
v.fit_transform(D)
print(v.get_feature_names())
print(len(v.get_feature_names()))
>> ['about', 'amazing', 'data', 'is', 'modeling', 'of', 'part', 'predictive', 'science', 'the']
>> 10
user_list = np.array([False, False, True, False, False, True, False, False, True, False], dtype=bool)
new_vocab = {}
for i in np.where(user_list)[0]:
print(v.get_feature_names()[i])
new_vocab[v.get_feature_names()[i]] = len(new_vocab)
new_vocab
>> data
>> of
>> science
>> {'data': 0, 'of': 1, 'science': 2}
v_copy = cp.deepcopy(v)
v_copy.vocabulary_ = new_vocab
print(v_copy.vocabulary_)
print(v_copy.get_feature_names())
v_copy
>> {'data': 0, 'of': 1, 'science': 2}
>> ['data', 'of', 'science']
>> CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\b\w\w+\b',
tokenizer=None, vocabulary=None)
v_copy.transform(D).toarray()
>> array([[2, 0, 1],
[0, 0, 1],
[1, 1, 1]], dtype=int64)
谢谢@Vivek! 对于非规范化的 CountVectorizer 对象,这似乎符合预期。
以对原始问题的评论形式回答实现@Vivek的建议:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
v = CountVectorizer()
D = ['Data science is about the data', 'The science is amazing', 'Predictive modeling is part of data science']
v.fit_transform(D)
print(v.get_feature_names())
print(len(v.get_feature_names()))
>> ['about', 'amazing', 'data', 'is', 'modeling', 'of', 'part', 'predictive', 'science', 'the']
>> 10
user_list = np.array([False, False, True, False, False, True, False, False, True, False], dtype=bool)
new_vocab = {}
for i in np.where(user_list)[0]:
print(v.get_feature_names()[i])
new_vocab[v.get_feature_names()[i]] = len(new_vocab)
new_vocab
>> data
>> of
>> science
>> {'data': 0, 'of': 1, 'science': 2}
v_copy = cp.deepcopy(v)
v_copy.vocabulary_ = new_vocab
print(v_copy.vocabulary_)
print(v_copy.get_feature_names())
v_copy
>> {'data': 0, 'of': 1, 'science': 2}
>> ['data', 'of', 'science']
>> CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern='(?u)\b\w\w+\b',
tokenizer=None, vocabulary=None)
v_copy.transform(D).toarray()
>> array([[2, 0, 1],
[0, 0, 1],
[1, 1, 1]], dtype=int64)
您可以将一个矢量器的词汇表分配或限制到另一个矢量器,如下所示:
from sklearn.feature_extraction.text import CountVectorizer
count_vect1 = CountVectorizer()
count_vect1.fit(list_of_strings1)
count_vect2 = CountVectorizer(vocabulary=count_vect1.vocabulary_)
count_vect2.fit(list_of_strings2)
答案改编自:值错误:维度不匹配