NER将BIO令牌组合成原始复合词

任何将 BIO 标记组合成复合词的方法。我实现了这种方法来形成来自 BIO 架构的单词，但这不适用于带有标点符号的单词。例如：使用以下函数的 S.E.C 将作为 S 加入它。E .C

def collapse(ner_result):
# List with the result
collapsed_result = []

current_entity_tokens = []
current_entity = None
# Iterate over the tagged tokens
for token, tag in ner_result:
if tag.startswith("B-"):
# ... if we have a previous entity in the buffer, store it in the result list
if current_entity is not None:
collapsed_result.append([" ".join(current_entity_tokens), current_entity])
current_entity = tag[2:]
# The new entity has so far only one token
current_entity_tokens = [token]
# If the entity continues ...
elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
# Just add the token buffer
current_entity_tokens.append(token)
else:
collapsed_result.append([" ".join(current_entity_tokens), current_entity])
collapsed_result.append([token,tag[2:]])
current_entity_tokens = []
current_entity = None
pass
# The last entity is still in the buffer, so add it to the result
# ... but only if there were some entity at all
if current_entity is not None:
collapsed_result.append([" ".join(current_entity_tokens), current_entity])
collapsed_result = sorted(collapsed_result)
collapsed_result = list(k for k,_ in itertools.groupby(collapsed_result))

return collapsed_result

另一种方法：-

我尝试使用TreebankWordDetokenizer去标记化，但它仍然没有形成原始句子。例如：Orig：句子 ->parties. n n IN WITNESS WHEREOF, the parties hereto标记化和去标记化的句子 ->parties . IN WITNESS WHEREOF, the parties hereto

另一个例子：Orig：句子->Group’s employment, Group shall be标记化和去标记化句子 ->Group ’ s employment, Group shall be

请注意，句点和换行符是使用 TreebankWordDetokenizer 剥离的。

有什么解决方法可以形成复合词吗？

一个非常小的修复应该可以完成这项工作：

def join_tokens(tokens):
res = ''
if tokens:
res = tokens[0]
for token in tokens[1:]:
if not (token.isalpha() and res[-1].isalpha()):
res += token  # punctuation
else:
res += ' ' + token  # regular word
return res
def collapse(ner_result):
# List with the result
collapsed_result = []

current_entity_tokens = []
current_entity = None
# Iterate over the tagged tokens
for token, tag in ner_result:
if tag.startswith("B-"):
# ... if we have a previous entity in the buffer, store it in the result list
if current_entity is not None:
collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
current_entity = tag[2:]
# The new entity has so far only one token
current_entity_tokens = [token]
# If the entity continues ...
elif current_entity_tokens!= None and tag == "I-" + str(current_entity):
# Just add the token buffer
current_entity_tokens.append(token)
else:
collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
collapsed_result.append([token,tag[2:]])
current_entity_tokens = []
current_entity = None
pass
# The last entity is still in the buffer, so add it to the result
# ... but only if there were some entity at all
if current_entity is not None:
collapsed_result.append([join_tokens(current_entity_tokens), current_entity])
collapsed_result = sorted(collapsed_result)
collapsed_result = list(k for k, _ in itertools.groupby(collapsed_result))
return collapsed_result

更新

这将解决大多数情况，但正如下面的评论所示，总会有异常值。因此，完整的解决方案是跟踪创建特定令牌的单词的身份。因此

text="U.S. Securities and Exchange Commission"
lut = [(token, ix) for ix, word in enumerate(text.split()) for token in tokenize(w)]  
# lut = [("U",0), (".",0), ("S",0), (".",0), ("Securities",1), ("and",2), ("Exchange",3), ("Commision",4)]

现在，给定的令牌索引，您可以知道它来自确切的单词，并简单地连接属于同一单词的标记，同时在令牌属于不同单词时添加空格。因此，NER结果将是这样的：

[["U","B-ORG", 0], [".","I-ORG", 0], ["S", "I-ORG", 0], [".","I-ORG", 0], ['Securities', 'I-ORG', 1], ['and', 'I-ORG', 2], ['Exchange', 'I-ORG',3], ['Commission', 'I-ORG', 4]]

更新

相关内容

最新更新

热门标签：