如何在textacy 0.6.2中初始化"文档"

尝试在Python 2中的文档中进行简单的Doc初始化是行不通的：

>>> import textacy
>>> content = '''
...     The apparent symmetry between the quark and lepton families of
...     the Standard Model (SM) are, at the very least, suggestive of
...     a more fundamental relationship between them. In some Beyond the
...     Standard Model theories, such interactions are mediated by
...     leptoquarks (LQs): hypothetical color-triplet bosons with both
...     lepton and baryon number and fractional electric charge.'''
>>> metadata = {
...     'title': 'A Search for 2nd-generation Leptoquarks at √s = 7 TeV',
...     'author': 'Burton DeWilde',
...     'pub_date': '2012-08-01'}
>>> doc = textacy.Doc(content, metadata=metadata)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/textacy/doc.py", line 120, in __init__
{compat.unicode_, SpacyDoc}, type(content)))
ValueError: `Doc` must be initialized with set([<type 'unicode'>, <type 'spacy.tokens.doc.Doc'>]) content, not "<type 'str'>"

字符串或字符串序列的简单初始化应该是什么样子？

更新：

将unicode(content)传递给textacy.Doc()会吐出

ImportError: 'cld2-cffi' must be installed to use textacy's automatic language detection; you may do so via 'pip install cld2-cffi' or 'pip install textacy[lang]'.

从安装textacy的那一刻起就很好了，imo.

即使在安装了cld2-cffi之后，尝试上面的代码也会抛出

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/textacy/doc.py", line 114, in __init__
self._init_from_text(content, metadata, lang)
File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/textacy/doc.py", line 136, in _init_from_text
spacy_lang = cache.load_spacy(langstr)
File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/cachetools/__init__.py", line 46, in wrapper
v = func(*args, **kwargs)
File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/textacy/cache.py", line 99, in load_spacy
return spacy.load(name, disable=disable)
File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/spacy/__init__.py", line 21, in load
return util.load_model(name, **overrides)
File "/Users/a/anaconda/envs/env1/lib/python2.7/site-packages/spacy/util.py", line 120, in load_model
raise IOError("Can't find model '%s'" % name)
IOError: Can't find model 'en'

如回溯中所示，问题出现在_init_from_text()函数中的textacy/doc.py处，该函数试图检测语言并用第136行中的字符串'en'调用它。(spacy回购在本期评论中谈到了这一点。(

我通过提供u'en_core_web_sm'的有效lang(unicode(字符串并在content和lang参数字符串中使用unicode来解决此问题。

import textacy
content = u'''
The apparent symmetry between the quark and lepton families of
the Standard Model (SM) are, at the very least, suggestive of
a more fundamental relationship between them. In some Beyond the
Standard Model theories, such interactions are mediated by
leptoquarks (LQs): hypothetical color-triplet bosons with both
lepton and baryon number and fractional electric charge.'''
metadata = {
'title': 'A Search for 2nd-generation Leptoquarks at √s = 7 TeV',
'author': 'Burton DeWilde',
'pub_date': '2012-08-01'}
doc = textacy.Doc(content, metadata=metadata, lang=u'en_core_web_sm')

字符串而不是unicode字符串(带有一条神秘的错误消息(会改变行为，缺少包的事实，以及使用spacy语言字符串的可能过时/可能不全面的方式，对我来说都像是bug。🤷‍♂️

您使用的是Python 2，但出现了unicode错误。在textacy文档中，有一条关于使用Python2:时unicode细微差别的注释

注意：在几乎所有情况下，textacy(以及spacy(都希望使用unicode文本数据。在整个代码中，这表示为str，以与Python3的默认字符串类型一致；然而，Python 2的用户必须注意使用unicode，并根据需要从默认(字节(字符串类型转换。

因此，我会尝试一下(注意u'''(：

content = u'''
The apparent symmetry between the quark and lepton families of
the Standard Model (SM) are, at the very least, suggestive of
a more fundamental relationship between them. In some Beyond the
Standard Model theories, such interactions are mediated by
leptoquarks (LQs): hypothetical color-triplet bosons with both
lepton and baryon number and fractional electric charge.'''

这产生了一个Doc对象，正如我所期望的那样(不过在Python 3上(。

相关内容

最新更新

热门标签：