如何将命名实体标记为使用Spacy的自定义命名实体识别的培训数据



我想在我的自定义数据集中训练名为"实体识别器"。我已经准备了一个python词典,其中 key = entity_type和values = extity name > ,但是我无法使用任何方法,我可以以适当的格式标记令牌。

我尝试过普通字符串匹配(查找(和正则表达式(搜索,编译(,但没有得到我想要的。

for ex:我的句子和我正在使用的dict是(这是示例(

sentence = "Machine learning and data mining often employ the same methods
and overlap significantly."
dic = {'MLDM': ['machine learning and data mining'], 'ML': ['machine learning'],
 'DM': ['data mining']}
for k,v in dic.items():
  for val in v:
    if val in sentence:
      print(k, val, sentence.index(val)) #right now I'm just printing 
#the key, val and starting index
output:
MLDM machine learning and data mining 0
ML machine learning 0
DM data mining 21
expected output: MLDM 0 32
so I can further prepare training data to train Spacy NER : 
[{"content":"machine learning and data mining often employ the same methods 
and overlap significantly.","entities":[[0,32,"MLDM"]]}

您可以从 dic中的所有值构建正则态度,以匹配它们,并在匹配时抓住与匹配值相关的键。我认为该值项在字典中是唯一的,它们可以包含空格,只包含"词"字符(没有特殊的字符,例如+((。

import re
sentence = "Machine learning and data mining often employ the same methods and overlap significantly."
dic = {'MLDM': ['machine learning and data mining'], 'ML': ['machine learning'],
 'DM': ['data mining']}
def get_key(val):
    for k,v in dic.items():
        if m.group().lower() in map(str.lower, v):
            return k
    return ''
# Flatten the lists in values and sort the list by length in descending order
l=sorted([v for x in dic.values() for v in x], key=len, reverse=True)
# Build the alternation based regex with b to match each item as a whole word 
rx=r'b(?:{})b'.format("|".join(l))
for m in re.finditer(rx, sentence, re.I): # Search case insensitively
    key = get_key(m.group())
    if key:
        print("{} {}".format(key, m.start()))

请参阅Python Demo

最新更新