多个 SpaCy 文档对象,并希望将它们合并为一个对象



我正在尝试标记詹姆士王圣经的文本文件,但是当我尝试时,我遇到了内存错误。 所以我把文本分成了多个对象。 现在我想使用 spaCy 来标记对象,然后将它们重新组合成一个文档对象。我看到其他人在谈论类似的问题并转换为数组,然后在组合数组后返回文档。 这是否可以解决我的问题或以后创建新问题?

我尝试运行它,但 colab 和我的计算机都没有 RAM 来支持它。

nlp_spacy = spacy.load('en')
kjv_bible  = gutenberg.raw('bible-kjv.txt')
#pattern for bracketed text titles
bracks = "[[].*?[]]"
kjv_bible = re.sub(bracks, "", kjv_bible)
kjv_bible =  ' '.join(kjv_bible.split())
len(kjv_bible)
kjv_bible_doc = nlp_spacy(kjv_bible)
ValueError                                Traceback (most recent call 
last)
<ipython-input-19-385936fadd40> in <module>()
----> 1 kjv_bible_doc = nlp_spacy(kjv_bible)
/usr/local/lib/python3.6/dist-packages/spacy/language.py in 
__call__(self, text, disable, component_cfg)
    378         if len(text) > self.max_length:
    379             raise ValueError(
--> 380                 Errors.E088.format(length=len(text), 
max_length=self.max_length)
    381             )
    382         doc = self.make_doc(text)
ValueError: [E088] Text of length 4305663 exceeds maximum of 1000000. 
The v2.x parser and NER models require roughly 1GB of temporary memory 
per 100,000 characters in the input. This means long texts may cause 
memory allocation errors. If you're not using the parser or NER, it's 
probably safe to increase the `nlp.max_length` limit. The limit is in 
number of characters, so you can check whether your inputs are too 
long by checking `len(text)`.

nlp.max_length = 4305663
kjv_bible_doc = nlp_spacy(kjv_bible)

由于 RAM 内存导致笔记本崩溃

这会起作用吗

np_array = doc.to_array([LOWER, POS, ENT_TYPE, IS_ALPHA])
np_array.extend(np_array2)
doc2.from_array([LOWER, POS, ENT_TYPE, IS_ALPHA], np_array)
您可以使用

函数Doc.from_docs连接多个 Doc 对象。以下是 spaCy 文档中的示例:

from spacy.tokens import Doc
texts = ["London is the capital of the United Kingdom.",
         "The River Thames flows through London.",
         "The famous Tower Bridge crosses the River Thames."]
docs = list(nlp.pipe(texts))
c_doc = Doc.from_docs(docs)
assert str(c_doc) == " ".join(texts)
assert len(list(c_doc.sents)) == len(docs)
assert [str(ent) for ent in c_doc.ents] == 
       [str(ent) for doc in docs for ent in doc.ents]

如果增加max_length,它将崩溃,除非您显式禁用使用大量内存的组件(解析器和 NER(。如果仅使用分词器,则可以在加载模型时禁用除分词器之外的所有内容:

nlp = spacy.load('en', disable=['tagger', 'parser', 'ner'])

最新更新