使用nltk正则表达式的例子在scikit-learn CountVectorizer



我试图使用nltk书中的一个示例,用于scikit-learn中的CountVectorizer中的正则表达式模式。我看到了一些简单的正则表达式的例子,但没有像这样的:

pattern = r''' (?x)         # set flag to allow verbose regexps 
    ([A-Z].)+          # abbreviations (e.g. U.S.A.)
    | w+(-w+)*        # words with optional internal hyphens
    | $?d+(.d+)?%?  # currency & percentages
    | ...            # ellipses '''
text = 'I love N.Y.C. 100% even with all of its traffic-ridden streets...'
vectorizer = CountVectorizer(stop_words='english',token_pattern=pattern)
analyze = vectorizer.build_analyzer()
analyze(text)

这产生:

[(u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'', u''),
 (u'', u'-ridden', u''),
 (u'', u'', u''),
 (u'', u'', u'')]

对于nltk,我得到了完全不同的东西:

nltk.regexp_tokenize(text,pattern)

['我',"爱","N.Y.C。"100","甚至","与","所有","的","它","traffic-ridden","街道","…")

是否有办法让技能CountVectorizer输出相同的东西?我希望使用一些其他方便的功能,这些功能被合并在同一个函数调用中

TL;DR

from functools import partial
CountVectorizer(analyzer=partial(regexp_tokenize, pattern=pattern))

是一个使用NLTK标记器的矢量化器。

现在对于实际问题:显然nltk.regexp_tokenize用它的模式做了一些非常特别的事情,而scikit-learn只是用你给它的模式做了一个re.findall,而findall不喜欢这个模式:

In [33]: re.findall(pattern, text)
Out[33]: 
[('', '', ''),
 ('', '', ''),
 ('C.', '', ''),
 ('', '', ''),
 ('', '', ''),
 ('', '', ''),
 ('', '', ''),
 ('', '', ''),
 ('', '', ''),
 ('', '-ridden', ''),
 ('', '', ''),
 ('', '', '')]

您必须重写此模式以使其以scikit-learn风格工作,或者将NLTK标记器插入scikit-learn:

In [41]: from functools import partial
In [42]: v = CountVectorizer(analyzer=partial(regexp_tokenize, pattern=pattern))
In [43]: v.build_analyzer()(text)
Out[43]: 
['I',
 'love',
 'N.Y.C.',
 '100',
 'even',
 'with',
 'all',
 'of',
 'its',
 'traffic-ridden',
 'streets',
 '...']

相关内容

  • 没有找到相关文章

最新更新