消除字符级别TF-IDF中的unigram



对于NLP问题,我想使用scikit learn TF-IDF矢量器提取单词中字母的唯一组合。然而,我对单个字母不感兴趣,而是对字母组合感兴趣,因此,例如";";应产生";th";以及";他"但不是";t〃"h〃;或";e";。我的理解是我应该能够使用ngram_range。然而,使用ngram_range=(2,3)仍然返回unigram。

示例:

from sklearn.feature_extraction.text import TfidfVectorizer
examples = ['The cat on the mat',
'Fast and bulbous']
tfidf =  TfidfVectorizer(max_features=None, 
analyzer='char_wb',
ngram_range=(2, 3))
data=tfidf.fit_transform(examples)
print(pd.DataFrame(data=data.todense(),
index=examples,
columns = tfidf.get_feature_names_out()))

给了我预期的2-和3-克结果,但也给了我unigram(即,我不想要"a"、"b"等(:

a        an         b        bu         c  
The cat on the mat  0.000000  0.000000  0.000000  0.000000  0.139994   
Fast and bulbous    0.181053  0.181053  0.181053  0.181053  0.000000   
ca         f        fa         m        ma  ...  
The cat on the mat  0.139994  0.000000  0.000000  0.139994  0.139994  ...   
Fast and bulbous    0.000000  0.181053  0.181053  0.000000  0.000000  ...   
s         st       st         t         th  
The cat on the mat  0.000000  0.000000  0.000000  0.199213  0.279987   
Fast and bulbous    0.181053  0.181053  0.181053  0.128821  0.000000   
the        ul       ulb        us       us   
The cat on the mat  0.279987  0.000000  0.000000  0.000000  0.000000  
Fast and bulbous    0.000000  0.181053  0.181053  0.181053  0.181053  
[2 rows x 53 columns]

我本以为ngram_range=(1,3)会有这个输出,但ngram_range=(2,3)不会。


编辑:我刚刚注意到";a";从";快速且球根状";,据推测,正如它作为";a";,即在";a";,但不在";垫子上的猫;作为";a";在";猫;被";c";以及";t";。同样地;u〃;未提取,因为在两个文本中都没有围绕它的空间。

似乎TfidfVectorizer正在提取包含空格的bigram。有办法把它关掉吗?(尽管我使用analyzer='char_wb'在词内搜索,而不是在词间搜索(。

我构造了一个可调用函数以传递给analyzer。根据分析仪设置为"char_wb"时使用的TfidfVectorizer源代码中的函数,它是从窃取的:

def char_wb_ngrams(text_document, ngram_range):
"""Callable for TfidfVectorizer analyzer, based on _char_wb_ngrams from TfidfVectorizer source code at
https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b6/sklearn/feature_extraction/text.py"""
ngrams = []
min_n, max_n = ngram_range
for w in text_document.lower().split():
# This line in _char_wb_ngrams pads words with spaces and needs to be removed:
#w = " " + w + " "
w_len = len(w)
for n in range(min_n, max_n + 1):
offset = 0
ngrams.append(w[offset : offset + n])
while offset + n < w_len:
offset += 1
ngrams.append(w[offset : offset + n])
if offset == 0:  # count a short word (w_len < n) only once
break
return ngrams

这适用于上面的示例数据:

from functools import partial
tfidf_no_space =  TfidfVectorizer(max_features=None, 
analyzer=partial(char_wb_ngrams, ngram_range=(2,3)),
ngram_range=(2, 3))
data=tfidf_no_space.fit_transform(examples)
print(pd.DataFrame(data=data.todense(),
index=examples,
columns = tfidf_no_space.get_feature_names_out()))

产生

an       and        as       ast        at  
The cat on the mat  0.000000  0.000000  0.000000  0.000000  0.436436   
Fast and bulbous    0.229416  0.229416  0.229416  0.229416  0.000000   
bo       bou        bu       bul        ca  ...  
The cat on the mat  0.000000  0.000000  0.000000  0.000000  0.218218  ...   
Fast and bulbous    0.229416  0.229416  0.229416  0.229416  0.000000  ...   
nd        on        ou       ous        st  
The cat on the mat  0.000000  0.218218  0.000000  0.000000  0.000000   
Fast and bulbous    0.229416  0.000000  0.229416  0.229416  0.229416   
th       the        ul       ulb        us  
The cat on the mat  0.436436  0.436436  0.000000  0.000000  0.000000  
Fast and bulbous    0.000000  0.000000  0.229416  0.229416  0.229416  
[2 rows x 28 columns]

不过,我不确定这是否适用于标点符号。最好有一个去掉标点符号的版本,也不需要调用partial(它修复了函数中的ngram_range(。

最新更新