了解和使用共指解析斯坦福NLP工具(在Python 3.7中)



我正在尝试理解Coreference NLP Stanford工具。这是我的代码,它正在工作

import os
os.environ["CORENLP_HOME"] = "/home/daniel/StanfordCoreNLP/stanford-corenlp-4.0.0"
from stanza.server import CoreNLPClient
text = 'When he came from Brazil, Daniel was fortified with letters from Conan but otherwise did not know a soul except Herbert. Yet this giant man from the Northeast, who had never worn an overcoat or experienced a change of seasons, did not seem surprised by his past.'
with CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner', 'parse', 'depparse','coref'],
properties={'annotators': 'coref', 'coref.algorithm' : 'neural'},timeout=30000, memory='16G') as client:
ann = client.annotate(text)
chains = ann.corefChain
chain_dict=dict()
for index_chain,chain in enumerate(chains):
chain_dict[index_chain]={}
chain_dict[index_chain]['ref']=''
chain_dict[index_chain]['mentions']=[{'mentionID':mention.mentionID,
'mentionType':mention.mentionType,
'number':mention.number,
'gender':mention.gender,
'animacy':mention.animacy,
'beginIndex':mention.beginIndex,
'endIndex':mention.endIndex,
'headIndex':mention.headIndex,
'sentenceIndex':mention.sentenceIndex,
'position':mention.position,
'ref':'',
} for mention in chain.mention ]

for k,v in chain_dict.items():
print('key',k)
mentions=v['mentions']
for mention in mentions:
words_list = ann.sentence[mention['sentenceIndex']].token[mention['beginIndex']:mention['endIndex']]
mention['ref']=' '.join(t.word for t in words_list)
print(mention['ref'])

我尝试了三种算法:

  1. 统计(如上面的代码所示(。结果
he
this giant man from the Northeast , who had never worn an overcoat or experienced a change of seasons
Daniel
his
  1. 神经
this giant man from the Northeast , who had never worn an overcoat or experienced a change of seasons ,
his
  1. 确定性(我在下面得到错误(

    > Starting server with command: java -Xmx16G -cp
    > /home/daniel/StanfordCoreNLP/stanford-corenlp-4.0.0/*
    > edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout
    > 30000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties
    > corenlp_server-9fedd1e9dfb14c9e.props -preload
    > tokenize,ssplit,pos,lemma,ner,parse,depparse,coref Traceback (most
    > recent call last):
    > 
    >   File "<ipython-input-58-0f665f07fd4d>", line 1, in <module>
    >     runfile('/home/daniel/Documentos/Working Papers/Leader traits/Code/20200704 - Modeling
    > Organizing/understanding_coreference.py',
    > wdir='/home/daniel/Documentos/Working Papers/Leader
    > traits/Code/20200704 - Modeling Organizing')
    > 
    >   File
    > "/home/daniel/anaconda3/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py",
    > line 827, in runfile
    >     execfile(filename, namespace)
    > 
    >   File
    > "/home/daniel/anaconda3/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py",
    > line 110, in execfile
    >     exec(compile(f.read(), filename, 'exec'), namespace)
    > 
    >   File "/home/daniel/Documentos/Working Papers/Leader
    > traits/Code/20200704 - Modeling
    > Organizing/understanding_coreference.py", line 21, in <module>
    >     ann = client.annotate(text)
    > 
    >   File
    > "/home/daniel/anaconda3/lib/python3.7/site-packages/stanza/server/client.py",
    > line 470, in annotate
    >     r = self._request(text.encode('utf-8'), request_properties, **kwargs)
    > 
    >   File
    > "/home/daniel/anaconda3/lib/python3.7/site-packages/stanza/server/client.py",
    > line 404, in _request
    >     raise AnnotationException(r.text)
    > 
    > AnnotationException: java.lang.RuntimeException:
    > java.lang.IllegalArgumentException: No enum constant
    > edu.stanford.nlp.coref.CorefProperties.CorefAlgorithmType.DETERMINISTIC
    

问题:

  1. 为什么我在确定性方面会出现此错误?

  2. 任何在Python中使用NLP Stanford的代码似乎都比与Spacy或NLTK相关的代码慢得多。我知道这些其他库中没有共同引用。但是例如,当我使用import nltk.parse.stanford import StanfordDependencyParser进行依赖解析时,它比这个斯坦福NLP库快得多。有没有办法在Python中加速这个CoreNLPClient?

  3. 我将使用此库来处理长文本。使用整个文本的小片段更好吗?长文本可能会导致共指解析的错误结果(当我使用长文本时,我发现这个共指库的结果非常奇怪(?是否有最佳尺寸?

  4. 结果:

统计算法的结果似乎更好。我预计最好的结果将来自神经算法。你同意我的看法吗?统计算法中有 4 个有效提及,而当我使用神经算法时只有 2 个。

我错过了什么吗?

  1. 您可以在 Java 文档中找到支持的算法列表:link

  2. 您可能希望启动服务器,然后使用它,例如

    # Here's the slowest part—models are being loaded
    client = CoreNLPClient(...)
    ann = client.annotate(text)
    ...
    client.stop()
    

但我不能给你任何关于 3 和 4 的线索。

最新更新