字符串中关键字之间的文本数据提取



从文件中提取并清理后的文本数据如下所示。我想把数据放入pandas数据框,其中列是('EXAMINATION', 'TECHNIQUE', 'COMPARISON', 'FINDINGS', 'IMPRESSION'),并且每行中的每个单元格包含与列名(即关键字)相关的提取数据。

最终报告检查:胸部PA和晚期适应症:F伴新发腹水感染评估技术:胸部PA和侧位比较:无发现;无局灶性实变性胸膜积液或气胸双侧结节性影影,极有可能代表乳头影,心膈廓影正常,左肺上方夹片影影,可能位于乳房内,上腹部未见明显左后第六和第七肋骨慢性畸形印象:无急性心肺过程

TECHNIQUE列下应有一个包含"胸部PA及侧位"的细胞,IMPRESSION列下应有一个包含"无急性心肺过程"的细胞。

解决方案如下,请注意以下假设:

  1. 关键字在示例文本中按此顺序排列。
  2. 要提取的文本中不包含关键字。
  3. 每个关键字后面跟着": "(冒号和空格被删除)。

解决方案
import pandas as pd
sample = "FINAL REPORT EXAMINATION: CHEST PA AND LAT INDICATION: F with new onset ascites eval for infection TECHNIQUE: Chest PA and lateral COMPARISON: None FINDINGS: There is no focal consolidation pleural effusion or pneumothorax Bilateral nodular opacities that most likely represent nipple shadows The cardiomediastinal silhouette is normal Clips project over the left lung potentially within the breast The imaged upper abdomen is unremarkable Chronic deformity of the posterior left sixth and seventh ribs are noted IMPRESSION: No acute cardiopulmonary process"
keywords = ["EXAMINATION", "TECHNIQUE", "COMPARISON", "FINDINGS", "IMPRESSION"]

# Create function to extract text between each of the keywords
def extract_text_using_keywords(clean_text, keyword_list):
extracted_texts = []
for prev_kw, current_kw in zip(keyword_list, keyword_list[1:]):
prev_kw_index = clean_text.index(prev_kw)
current_kw_index = clean_text.index(current_kw)
extracted_texts.append(clean_text[prev_kw_index + len(prev_kw) + 2:current_kw_index])
# Extract the text after the final keyword in keyword_list (i.e. "IMPRESSION")
if current_kw == keyword_list[-1]:
extracted_texts.append(clean_text[current_kw_index + len(current_kw) + 2:len(clean_text)])
return extracted_texts

# Extract text
result = extract_text_using_keywords(sample, keywords)
# Create pandas dataframe
df = pd.DataFrame([result], columns=keywords)
print(df)
# To append future results to the end of the pandas df you can use
# df.loc[len(df)] = result

输出
EXAMINATION                                        TECHNIQUE                  COMPARISON    FINDINGS                                           IMPRESSION
0  CHEST PA AND LAT INDICATION: F with new onset ...  Chest PA and lateral       None          There is no focal consolidation pleural effusi...  No acute cardiopulmonary process

看起来输入是按照EXAMINATION,TECHNIQUE等顺序出现的。

一种方法是遍历字符串对,并使用.split()在它们之间选择内容。下面是一种方法:

import pandas as pd
data = 'FINAL REPORT EXAMINATION: CHEST PA AND LAT INDICATION: F with new onset ascites eval for infection TECHNIQUE: Chest PA and lateral COMPARISON: None FINDINGS: There is no focal consolidation pleural effusion or pneumothorax Bilateral nodular opacities that most likely represent nipple shadows The cardiomediastinal silhouette is normal Clips project over the left lung potentially within the breast The imaged upper abdomen is unremarkable Chronic deformity of the posterior left sixth and seventh ribs are noted IMPRESSION: No acute cardiopulmonary process'
strings = ('EXAMINATION','TECHNIQUE', 'COMPARISON','FINDINGS', 'IMPRESSION', '')
out = {}
for s1, s2 in zip(strings, strings[1:]):
if not s2:
text = data.split(s1)[1]
else:
text = data.split(s1)[1].split(s2)[0]
out[s1] = [text]
print(pd.DataFrame(out))

结果是:

EXAMINATION                TECHNIQUE COMPARISON                                           FINDINGS                          IMPRESSION
0  : CHEST PA AND LAT INDICATION: F with new onse...  : Chest PA and lateral     : None   : There is no focal consolidation pleural effu...  : No acute cardiopulmonary process