我试图使用nltk书中的一个示例,用于scikit-learn中的CountVectorizer中的正则表达式模式。我看到了一些简单的正则表达式的例子,但没有像这样的:
pattern = r''' (?x) # set flag to allow verbose regexps
([A-Z].)+ # abbreviations (e.g. U.S.A.)
| w+(-w+)* # words with optional internal hyphens
| $?d+(.d+)?%? # currency & percentages
| ... # ellipses '''
text = 'I love N.Y.C. 100% even with all of its traffic-ridden streets...'
vectorizer = CountVectorizer(stop_words='english',token_pattern=pattern)
analyze = vectorizer.build_analyzer()
analyze(text)
这产生:
[(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'', u''),
(u'', u'-ridden', u''),
(u'', u'', u''),
(u'', u'', u'')]
对于nltk,我得到了完全不同的东西:
nltk.regexp_tokenize(text,pattern)
['我',"爱","N.Y.C。"100","甚至","与","所有","的","它","traffic-ridden","街道","…")
是否有办法让技能CountVectorizer输出相同的东西?我希望使用一些其他方便的功能,这些功能被合并在同一个函数调用中
TL;DR
from functools import partial
CountVectorizer(analyzer=partial(regexp_tokenize, pattern=pattern))
是一个使用NLTK标记器的矢量化器。
现在对于实际问题:显然nltk.regexp_tokenize
用它的模式做了一些非常特别的事情,而scikit-learn只是用你给它的模式做了一个re.findall
,而findall
不喜欢这个模式:
In [33]: re.findall(pattern, text)
Out[33]:
[('', '', ''),
('', '', ''),
('C.', '', ''),
('', '', ''),
('', '', ''),
('', '', ''),
('', '', ''),
('', '', ''),
('', '', ''),
('', '-ridden', ''),
('', '', ''),
('', '', '')]
您必须重写此模式以使其以scikit-learn风格工作,或者将NLTK标记器插入scikit-learn:
In [41]: from functools import partial
In [42]: v = CountVectorizer(analyzer=partial(regexp_tokenize, pattern=pattern))
In [43]: v.build_analyzer()(text)
Out[43]:
['I',
'love',
'N.Y.C.',
'100',
'even',
'with',
'all',
'of',
'its',
'traffic-ridden',
'streets',
'...']