'nltk' CoreNLPParser:防止在POS标记器中的连字符处拆分



我正在使用带有斯坦福NLP服务器的nltkCoreNLPParser进行POS标记,如本答案中所述。

此标记器将带有连字符的单词视为多个单词,例如,像2007-08这样的日期被标记为CP, :, CP。但是,我的模型使用带有连字符的单词作为标记。是否可以使用CoreNLPParser来防止连字符处拆分?

TL;博士

from nltk.parse.corenlp import GenericCoreNLPParser
class CoreNLPParser(GenericCoreNLPParser):
_OUTPUT_FORMAT = 'penn'
parser_annotator = 'parse'
def make_tree(self, result):
return Tree.fromstring(result['parse'])
def tag_sents(self, sentences, properties=None):
"""
Tag multiple sentences.
Takes multiple sentences as a list where each sentence is a list of
tokens.
:param sentences: Input sentences to tag
:type sentences: list(list(str))
:rtype: list(list(tuple(str, str))
"""
# Converting list(list(str)) -> list(str)
sentences = (' '.join(words) for words in sentences)
if properties == None:
properties = {'tokenize.whitespace':'true'}
return [sentences[0] for sentences in self.raw_tag_sents(sentences, properties)]
def tag(self, sentence, properties=None):
"""
Tag a list of tokens.
:rtype: list(tuple(str, str))
>>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
>>> tokens = 'Rami Eid is studying at Stony Brook University in NY'.split()
>>> parser.tag(tokens)
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'),
('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'O')]
>>> parser = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
>>> tokens = "What is the airspeed of an unladen swallow ?".split()
>>> parser.tag(tokens)
[('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'),
('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'),
('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
"""
return self.tag_sents([sentence], properties)[0]
def raw_tag_sents(self, sentences, properties=None):
"""
Tag multiple sentences.
Takes multiple sentences as a list where each sentence is a string.
:param sentences: Input sentences to tag
:type sentences: list(str)
:rtype: list(list(list(tuple(str, str)))
"""
default_properties = {'ssplit.isOneSentence': 'true',
'annotators': 'tokenize,ssplit,' }
default_properties.update(properties or {})
# Supports only 'pos' or 'ner' tags.
assert self.tagtype in ['pos', 'ner']
default_properties['annotators'] += self.tagtype
for sentence in sentences:
tagged_data = self.api_call(sentence, properties=default_properties)
yield [[(token['word'], token[self.tagtype]) for token in tagged_sentence['tokens']]
for tagged_sentence in tagged_data['sentences']]
pos_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='pos')
sent = ['My', 'birthday', 'is', 'on', '09-12-2050']
print(pos_tagger.tag(sent))

[输出]:

[('My', 'PRP$'), ('birthday', 'NN'), ('is', 'VBZ'), ('on', 'IN'), ('09-12-2050', 'CD')]

在长

  • 为什么 CoreNLP ner 标记器和 ner 标记器将分隔的数字连接在一起?
  • https://github.com/nltk/nltk/issues/2112

最新更新