如何从文本文件中提取自定义实体列表

我有一个实体列表，看起来像这样：

["Bluechoice HMO/POS", "Pathway X HMO/PPO", "HMO", "Indemnity/Traditional Health Plan/Standard"]

这不是详尽的列表，还有其他类似的条目。

我想从一个文本文件(包含30多页信息(中提取这些实体(如果存在的话(。这里的关键是，这个文本文件是使用OCR生成的，因此可能不包含确切的条目。例如，它可能有：

"Out of all the entries the user made, BIueChoise HMOIPOS is the most prominent"

注意"中的拼写错误；BIueChoise HMOIPOS"；w.r.t."；Bluechoice HMO/POS"；。

我想要那些存在于文本文件中的实体，即使对应的单词不完全匹配。

任何帮助，无论是算法还是方法，都是受欢迎的。非常感谢！

您可以使用算法来实现这一点，这些算法可以近似匹配字符串并确定它们的相似程度，如Levenstein距离、Hamming距离、余弦相似性等等。

textdistance是一个模块，它提供了一系列这样的算法供您使用。请在这里查看。

我也遇到了类似的问题，我使用textdistance解决了这个问题，方法是从文本文件中选择长度等于我需要搜索/提取的字符串的子字符串，然后使用其中一种算法来查看哪种算法解决了我的问题。对我来说，当我过滤出模糊匹配度超过75%的字符串时，余弦相似性给了我最好的结果。

取"；Bluechoice HMO/POS"；从你的问题作为一个例子给你一个想法，我应用如下：

>>> import textdistance
>>>
>>> search_strg = "Bluechoice HMO/POS"
>>> text_file_strg = "Out of all the entries the user made, BIueChoise HMOIPOS is the most prominent"
>>>
>>> extracted_strgs = []
>>> for substr in [text_file_strg[i:i+len(search_strg)] for i in range(0,len(text_file_strg) - len(search_strg)+1)]:
...     if textdistance.cosine(substr, search_strg) > 0.75:
...             extracted_strgs.append(substr)
... 
>>> extracted_strgs
['BIueChoise HMOIPOS']

相关内容

最新更新

热门标签：