计数矢量化器："I"未显示在矢量化文本中

我是scikit-learn的新手，目前正在研究朴素贝叶斯（多项式）。现在，我正在对 sklearn.feature_extraction.text 中的文本进行矢量化，出于某种原因，当我矢量化某些文本时，单词"I"不会出现在输出的数组中。

法典：

x_train = ['I am a Nigerian hacker', 'I like puppies']
# convert x_train to vectorized text
vectorizer_train = CountVectorizer(min_df=0)
vectorizer_train.fit(x_train)
x_train_array = vectorizer_train.transform(x_train).toarray()
# print vectorized text, feature names
print x_train_array
print vectorizer_train.get_feature_names()

输出：

1 1 0 1 0
0 0 1 0 1
[u'am', u'hacker', u'like', u'nigerian', u'puppies']

为什么"我"似乎没有出现在功能名称中？当我将其更改为"Ia"或其他类似的东西时，它确实会出现。

这是

由 CountVectorizer 的默认token_pattern引起的，该删除了单个字符的标记：

>>> vectorizer_train
CountVectorizer(analyzer=u'word', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=0,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\b\w\w+\b',
        tokenizer=None, vocabulary=None)
>>> pattern = re.compile(vectorizer_train.token_pattern, re.UNICODE)
>>> print(pattern.match("I"))
None

要保留"I"，请使用不同的模式，例如

>>> vectorizer_train = CountVectorizer(min_df=0, token_pattern=r"bw+b")
>>> vectorizer_train.fit(x_train)
CountVectorizer(analyzer=u'word', binary=False, charset=None,
        charset_error=None, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=0,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern='\b\w+\b', tokenizer=None,
        vocabulary=None)
>>> vectorizer_train.get_feature_names()
[u'a', u'am', u'hacker', u'i', u'like', u'nigerian', u'puppies']

请注意，现在还保留了非信息性单词"a"。

这是因为默认情况下关闭了大写字母检测 lowercase=True 在 CountVectorizer 中

用

vectorizer_train = CountVectorizer(min_df=0, lowercase=False)

相关内容

最新更新

热门标签：