我正在使用 Amazon Textract 来分析匿名血液测试。 它由标记,它们的值,单位,引用间隔组成。
我想将它们提取到这样的字典中:
{"globulin": [2.8, gidL, [1.0, 4.0]], "cholesterol": [161, mg/dL, [120, 240]], .... }
以下是此类 OCR 生成的文本的示例:
Name:
Date Perfermed
$/6/2010
DOBESevState:
Date Collected:
05/03/201004.00 PN
Date Lac Meat: 05/03/2010 10.45 A
Eraminer:
PTM
Date Received: $/7/2010 12:13.11A
Tukit No.
8028522035
Abeormal
Normal
Range
CARDLAC RISK
CHOLESTEROL
161.00
120.00 240.00 mg/dL
CHOLESTEROLHDL RATIO
2.39
1.250 5.00
HIGH DENSITY LIPOPROTEINCHDL)
67.30
35.00 75.00 me/dL
LOW DENSITY LIPOPROTEIN (LDL)
78.70
60.00 a 190.00 midI.
TRIGLYCERIDES
75.00
10.00 a 200.00 made
CHEMISTRIES
ALBUMIN
4.40
3.50 5.50 pidl
ALKALINE PHOSPHATASE
49.00
30.00 120.00 UAL
BLOOD UREA NITROGEN (BUN)
17.00
6.00 2500 meidL
CREATININE
0,85
060 1.50 matdL
FRUCTOSAMINE
182
1.20 1.79 mmoV/l
GAMMA GLUTAMYUTRANSFERASE
9.00
2.00 65.00 UIL
GLOBULIN
2.80
1.00 4.00 gidL.
GLUCOSE
61.00
70.00 125.00 me/dl.
HEMOGLOBIN AIC
5.10
3.00 6.00 %
SGOT (AST)
25.00
0.00 41.00 UM
SOPI (ALT)
22.00
0.00 45.00 IMI
TOTAL BILIRUBIN
0.52
0.10 1.20 mmeldi.
TOTAL PROTEIN
720
6.00 8.50 gidl.
1. This sample lab report shows both normal and abnormal results. as well as
acceptable reference ranges for each testing category.
请告知提取此信息的最佳方法是什么,我已经尝试过Amazon Comprehend medical - 它可以完成工作,但不适用于所有图像。 尝试过空间:https://github.com/NLPatVCU/medaCy, https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da
这可能不是NLP的一个很好的应用,因为文本不是任何类型的自然语言。相反,它们是可以使用规则提取的结构化数据。编写规则绝对是实现这一目标的一种方法。
-
您可以首先尝试对 OCR 结果上的类别进行模糊匹配,即"心脏风险"和"化学",以将字符串划分为各自的类别。
-
如果您确定每个条目只需要 3 行,您可以简单地按换行符对它们进行分区并从那里提取数据。
-
一旦将它们拆分为条目
以下是我针对您提供的数据运行的一些示例代码。它需要可以通过运行python3 -m pip install fuzzyset
获得的fuzzyset
包。由于某些条目没有单位,我稍微修改了您想要的输出格式,并将单位制作成列表,因此它很容易为空。它还存储在第三行中找到的随机字母。
from fuzzyset import FuzzySet
### Load data
with open("ocr_result.txt") as f:
data = f.read()
lines = data.split("n")
### Create fuzzy set
CATEGORIES = ("CARDIAC RISK", "chemistries")
fs = FuzzySet(lines)
### Get the line ranges of each category
cat_ranges = [0] * (len(CATEGORIES) + 1)
for i, cat in enumerate(CATEGORIES):
match = fs.get(cat)[0]
match_idx = lines.index(match[1])
cat_ranges[i] = match_idx
last_idx = lines.index(fs.get("sample lab report")[0][1])
cat_ranges[-1] = last_idx
### Read lines in each category
def _to_float(s: str) -> float:
"""
Attempt to convert a string value to float
"""
try:
f = float(s)
except ValueError:
if "," in s:
s = s.replace(",", ".")
f = float(s)
else:
raise ValueError(f"Cannot convert {s} to float.")
return f
result = {}
for i, cat in enumerate(CATEGORIES):
result[cat] = {}
# Ignore the line of the category itself
s = slice(cat_ranges[i] + 1, cat_ranges[i + 1])
lines_in_cat = lines[s]
if len(lines_in_cat) % 3 != 0:
breakpoint()
raise ValueError("Something's wrong")
for i in range(0, len(lines_in_cat), 3):
_name = lines_in_cat[i]
_value = lines_in_cat[i + 1]
_line_3 = lines_in_cat[i + 2].split(" ")
# Convert value to float
_value = _to_float(_value)
# Process line 3 to get range and unit
_range = []
_unit = []
for i, v in enumerate(_line_3):
if v[0].isdigit() and len(_range) < 2:
_range.append(_to_float(v))
else:
_unit.append(v)
_l = [_value, _unit, _range]
result[cat][_name] = _l
print(result)
输出:
{'CARDIAC RISK': {'CHOLESTEROL': [161.0, ['mg/dL'], [120.0, 240.0]], 'CHOLESTEROLHDL RATIO': [2.39, [], [1.25, 5.0]], 'HIGH DENSITY LIPOPROTEINCHDL)': [67.3, ['me/dL'], [35.0, 75.0]], 'LOW DENSITY LIPOPROTEIN (LDL)': [78.7, ['a', 'midI.'], [60.0, 190.0]], 'TRIGLYCERIDES': [75.0, ['a', 'made'], [10.0, 200.0]]}, 'chemistries': {'ALBUMIN': [4.4, ['pidl'], [3.5, 5.5]], 'ALKALINE PHOSPHATASE': [49.0, ['UAL'], [30.0, 120.0]], 'BLOOD UREA NITROGEN (BUN)': [17.0, ['meidL'], [6.0, 2500.0]], 'CREATININE': [0.85, ['matdL'], [60.0, 1.5]], 'FRUCTOSAMINE': [182.0, ['mmoV/l'], [1.2, 1.79]], 'GAMMA GLUTAMYUTRANSFERASE': [9.0, ['UIL'], [2.0, 65.0]], 'GLOBULIN': [2.8, ['gidL.'], [1.0, 4.0]], 'GLUCOSE': [61.0, ['me/dl.'], [70.0, 125.0]], 'HEMOGLOBIN AIC': [5.1, ['%'], [3.0, 6.0]], 'SGOT (AST)': [25.0, ['UM'], [0.0, 41.0]], 'SOPI (ALT)': [22.0, ['IMI'], [0.0, 45.0]], 'TOTAL BILIRUBIN': [0.52, ['mmeldi.'], [0.1, 1.2]], 'TOTAL PROTEIN': [720.0, ['gidl.'], [6.0, 8.5]]}}