从不同子目录中的文件中提取文本数据会引起"ValueError: substring not found"



我试图从不同子目录下的文件中提取文本数据,并将提取的数据放入pandas数据框架中。

文本数据的示例如下:

"检查:胸部PA和侧位检查适应症:病史:F伴呼吸短促技术:胸部PA和侧位比较:结果:心脏纵隔和肺门轮廓正常。肺血管正常。肺清晰。无胸腔积液或气胸。多个片段再次出现在左乳房上。远端左侧肋骨骨折也被再次证实。印象:无急性心肺异常。

然而,当尝试执行下面给出的代码时,它产生了以下错误,我如何解决这个问题?

错误
ValueError                                Traceback (most recent call last)
<ipython-input-108-bbeeb452bdef> in <module>
48         df = pd.DataFrame(columns=keywords)
49         # Extract text
---> 50         result = extract_text_using_keywords(text, keywords)
51         # Append list of extracted text to the end of the pandas df
52         df.loc[len(df)] = result
<ipython-input-108-bbeeb452bdef> in extract_text_using_keywords(clean_text, keyword_list)
39             for prev_kw, current_kw in zip(keyword_list, keyword_list[1:]):
40                 prev_kw_index = clean_text.index(prev_kw)
---> 41                 current_kw_index = clean_text.index(current_kw)
42                 extracted_texts.append(clean_text[prev_kw_index + len(prev_kw) + 2:current_kw_index])
43                 if current_kw == keyword_list[-1]:
ValueError: substring not found

代码
out = []
result = {}
for filename in glob.iglob('/content/sample_data/**/*.txt', recursive = True):

out.append(filename)
print('File names: ',out)
for file in out:

with open(file) as f:
data = f.read()


import re
text = re.sub(r"[-_()n"#//@;<>{}=~|?,]*", "", data)
text = re.sub(r'FINAL REPORT', '', text)
text = re.sub(r's+', ' ', text)
print(text)
keywords = ["INDICATION", "TECHNIQUE", "COMPARISON", "FINDINGS", "IMPRESSION"]
# Create function to extract text between each of the keywords
# Assumption
def extract_text_using_keywords(clean_text, keyword_list):
extracted_texts = []
for prev_kw, current_kw in zip(keyword_list, keyword_list[1:]):
prev_kw_index = clean_text.index(prev_kw)
current_kw_index = clean_text.index(current_kw)
extracted_texts.append(clean_text[prev_kw_index + len(prev_kw) + 2:current_kw_index])
if current_kw == keyword_list[-1]:
extracted_texts.append(clean_text[current_kw_index + len(current_kw) + 2:len(clean_text)])
return extracted_texts
# Create empty pandas df with keywords as column names
df = pd.DataFrame(columns=keywords)
# Extract text
result = extract_text_using_keywords(text, keywords)
# Append list of extracted text to the end of the pandas df
df.loc[len(df)] = result
#print(df)
with pd.option_context('display.max_colwidth', None): # For diplaying full columns
display(df)

ValueErrorcurrent_kw_index = clean_text.index(current_kw)行中的函数调用index()引发,因为clean_text不包含代码试图查找的current_kw

很可能在您的一个文件中,datatext中,您对result = extract_text_using_keywords(text, keywords)的输入既不包含"指示"、"技术"、"比较"、"发现"或"印象"。因此,解决这个问题的最简单方法是检查导致问题的文件并添加必要的关键字。

为了使调试更容易,您可以更新extract_text_using_keywords()函数以包含try except块,从而为ValueError提供更有用的输出。您还可以更新代码的其他部分,以处理由于无法找到关键字而导致的后续问题。完整的解决方案如下:

import glob
import pandas as pd
import re
# Get & print all .txt file names with directory information
out = []
for filename in glob.iglob('content/sample_data/**/*.txt', recursive = True):
out.append(filename)
print('File names: ', out)
# Define keywords
keywords = ["INDICATION", "TECHNIQUE", "COMPARISON", "FINDINGS", "IMPRESSION"]
# Create empty pandas df with keywords as column names
df = pd.DataFrame(columns=keywords)

# Create function to extract text between each of the keywords
def extract_text_using_keywords(clean_text, keyword_list):
extracted_texts = []
for prev_kw, current_kw in zip(keyword_list, keyword_list[1:]):
try:            
prev_kw_index = clean_text.index(prev_kw)
except ValueError:
print("Keyword {} was not found in the text.".format(prev_kw))
try:
current_kw_index = clean_text.index(current_kw)
except ValueError:
print("Keyword {} was not found in the text.".format(current_kw))
try:
extracted_texts.append(clean_text[prev_kw_index + len(prev_kw) + 2:current_kw_index])
if current_kw == keyword_list[-1]:
extracted_texts.append(clean_text[current_kw_index + len(current_kw) + 2:len(clean_text)])
except UnboundLocalError:
print("An index was not assigned for a particular keyword.")
return extracted_texts

# Iterate over all .txt files
for file in out:
with open(file) as f:
data = f.read()
text = re.sub(r"[-_()n"#//@;<>{}=~|?,]*", "", data)
text = re.sub(r'FINAL REPORT', '', text)
text = re.sub(r's+', ' ', text)
# print(text)
# Extract text
result = extract_text_using_keywords(text, keywords)
# If all keywords and their results were found
if len(result) == len(keywords):
# Append list of extracted text to the end of the pandas df
df.loc[len(df)] = result
else:
print("nFailed to extract text for one or more keywords.
nPlease check that {} are all present in the following text:nn{}n".format(keywords, text))
# Display results
print(df)
# with pd.option_context('display.max_colwidth', None): # For diplaying full columns
#     display(df)

当不包含关键字时(例如"TECHNIQUE")产生以下错误输出:

Keyword TECHNIQUE was not found in the text.
An index was not assigned for a particular keyword.
Keyword TECHNIQUE was not found in the text.
Failed to extract text for one or more keywords.
Please check that ['EXAMINATION', 'TECHNIQUE', 'COMPARISON', 'FINDINGS', 'IMPRESSION'] are all present in the following text:
EXAMINATION: CHEST PA AND LAT INDICATION: F with new onset ascites eval for infection : Chest PA and lateral COMPARISON: None FINDINGS: There is no focal consolidation pleural effusion or pneumothorax Bilateral nodular opacities that most likely represent nipple shadows The cardiomediastinal silhouette is normal Clips project over the left lung potentially within the breast The imaged upper abdomen is unremarkable Chronic deformity of the posterior left sixth and seventh ribs are noted IMPRESSION: No acute cardiopulmonary process
Empty DataFrame
Columns: [INDICATION, TECHNIQUE, COMPARISON, FINDINGS, IMPRESSION]
Index: []

并在包含所有关键字时产生所需的输出:

File names:  ['content/sample_data\my_data.txt', 'content/sample_data\my_data2.txt']
INDICATION              TECHNIQUE COMPARISON                                           FINDINGS                        IMPRESSION
0  F with new onset ascites eval for infection   Chest PA and lateral       None   There is no focal consolidation pleural effusi...  No acute cardiopulmonary process
1   Chronic pain noted in lower erector spinae                Palpate       None   Upper iliocostalis thoracis triggers pain alon...                               Nil

相关内容