NER将BIO令牌组合成原始复合词



任何将 BIO 标记组合成复合词的方法。 我实现了这种方法来形成来自 BIO 架构的单词,但这不适用于带有标点符号的单词。例如:使用以下函数的 S.E.C 将作为 S 加入它。E .C

def collapse(ner_result):
# List with the result
collapsed_result = []

current_entity_tokens = []
current_entity = None
# Iterate over the tagged tokens
for token, tag in ner_result:
if tag.startswith("B-"):
# ... if we have a previous entity in the buffer, store it in the result list
if current_entity is not None:
collapsed_result.append([" ".join(current_entity_tokens), current_entity])
current_entity = tag[2:]
# The new entity has so far only one token
current_entity_tokens = [token]
# If the entity continues ...
elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
# Just add the token buffer
current_entity_tokens.append(token)
else:
collapsed_result.append([" ".join(current_entity_tokens), current_entity])
collapsed_result.append([token,tag[2:]])
current_entity_tokens = []
current_entity = None
pass
# The last entity is still in the buffer, so add it to the result
# ... but only if there were some entity at all
if current_entity is not None:
collapsed_result.append([" ".join(current_entity_tokens), current_entity])
collapsed_result = sorted(collapsed_result)
collapsed_result = list(k for k,_ in itertools.groupby(collapsed_result))

return collapsed_result

另一种方法:-

我尝试使用TreebankWordDetokenizer去标记化,但它仍然没有形成原始句子。例如:Orig:句子 ->parties. n n IN WITNESS WHEREOF, the parties hereto标记化和去标记化的句子 ->parties . IN WITNESS WHEREOF, the parties hereto

另一个例子:Orig:句子->Group’s employment, Group shall be标记化和去标记化句子 ->Group ’ s employment, Group shall be

请注意,句点和换行符是使用 TreebankWordDetokenizer 剥离的。

有什么解决方法可以形成复合词吗?

一个非常小的修复应该可以完成这项工作:

def join_tokens(tokens):
res = ''
if tokens:
res = tokens[0]
for token in tokens[1:]:
if not (token.isalpha() and res[-1].isalpha()):
res += token  # punctuation
else:
res += ' ' + token  # regular word
return res
def collapse(ner_result):
# List with the result
collapsed_result = []

current_entity_tokens = []
current_entity = None
# Iterate over the tagged tokens
for token, tag in ner_result:
if tag.startswith("B-"):
# ... if we have a previous entity in the buffer, store it in the result list
if current_entity is not None:
collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
current_entity = tag[2:]
# The new entity has so far only one token
current_entity_tokens = [token]
# If the entity continues ...
elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
# Just add the token buffer
current_entity_tokens.append(token)
else:
collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
collapsed_result.append([token,tag[2:]])
current_entity_tokens = []
current_entity = None
pass
# The last entity is still in the buffer, so add it to the result
# ... but only if there were some entity at all
if current_entity is not None:
collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
collapsed_result = sorted(collapsed_result)
collapsed_result = list(k for k, _ in itertools.groupby(collapsed_result))
return collapsed_result

更新

这将解决大多数情况,但正如下面的评论所示,总会有异常值。因此,完整的解决方案是跟踪创建特定令牌的单词的身份。因此

text="U.S. Securities and Exchange Commission"
lut = [(token, ix) for ix, word in enumerate(text.split()) for token in tokenize(w)]  
# lut = [("U",0), (".",0), ("S",0), (".",0), ("Securities",1), ("and",2), ("Exchange",3), ("Commision",4)]

现在,给定的令牌索引,您可以知道它来自确切的单词,并简单地连接属于同一单词的标记,同时在令牌属于不同单词时添加空格。因此,NER结果将是这样的:

[["U","B-ORG", 0], [".","I-ORG", 0], ["S", "I-ORG", 0], [".","I-ORG", 0], ['Securities', 'I-ORG', 1], ['and', 'I-ORG', 2], ['Exchange', 'I-ORG',3], ['Commission', 'I-ORG', 4]]

最新更新