在re.findall搜索后,我应该得到更少或更多的行



我有未处理的文本,我想从中提取患者的性别,但我最终有更少或更多的行,我应该如何处理这种错误?

fil = data['transcription']
print(fil)

输出:

0       SUBJECTIVE:,  This 23-year-old white female pr...
1       PAST MEDICAL HISTORY:, He has difficulty climb...
2       HISTORY OF PRESENT ILLNESS: , I have seen ABC ...
3       2-D M-MODE: , ,1.  Left atrial enlargement wit...
4       1.  The left ventricular cavity size and wall ...
...                        
4994    HISTORY:,  I had the pleasure of meeting and e...
4995    ADMITTING DIAGNOSIS: , Kawasaki disease.,DISCH...
4996    SUBJECTIVE: , This is a 42-year-old white fema...
4997    CHIEF COMPLAINT: , This 5-year-old male presen...
4998    HISTORY: , A 34-year-old male presents today s...
Name: transcription, Length: 4999, dtype: object

这是从文本中提取性别的代码

import re
gender_aux = []
for i in fil:
try:
gender = re.findall("female|gentleman|woman|lady|man|male|girl|boy|she|he", i) or [" "]
except:
gender_aux.append(' ')
#         pass
gender_dict = {"male": ["gentleman", "man", "male", "boy",'he'],
"female": ["lady","female", "woman", "girl",'she']}
for g in gender:
if g in gender_dict['male']:
gender_aux.append('male')
break
elif g in gender_dict['female']:
gender_aux.append('female')
break
else:
gender_aux+=[' ']
break
print(len(gender_aux))            
print(gender_aux)

如果我删除或["]其他,我得到4967,否则我最终会得到5032,实际上我应该收到4999总实例

输出:

4967 or 5032 #it should be 4999 when i do print(len(gender_aux))
['female', 'male', 'male', ' ', 'male', 'male', 'male', 'male', 'male', ' ', 'male'...]

我注意到您没有转换为lower()flags=re.IGNORECASE,这可能会对您的最终字数产生影响。

主要问题是当re.findall与字符串中的任何性别都不匹配时for循环最终将无法运行。为了避免这种情况,我检查是否有来自re.findall,如果不是,只需附加空白字符串即可。

import pandas as pd
import re
text = pd.Series([
"SUBJECTIVE:,  This 23-year-old white female pr...",
"PAST MEDICAL HISTORY:, He has difficulty climb...",
"HISTORY OF PRESENT ILLNESS: , I have seen ABC ...",
"2-D M-MODE: , ,1.  Left atrial enlargement wit...",
"1.  The left ventricular cavity size and wall ...",
"HISTORY:,  I had the pleasure of meeting and e...",
"ADMITTING DIAGNOSIS: , Kawasaki disease.,DISCH...",
"SUBJECTIVE: , This is a 42-year-old white fema...",
"CHIEF COMPLAINT: , This 5-year-old male presen...",
"HISTORY: , A 34-year-old male presents today s..."
])
gender_dict = {"male": ["gentleman", "man", "male", "boy",'he'],
"female": ["lady","female", "woman", "girl",'she']}
gender_aux = []
for line in text:
gender = re.findall("female|gentleman|woman|lady|man|male|girl|boy|she|he", line.lower())
if len(gender):
for g in gender:
if g in gender_dict['male']:
gender_aux.append('male')
break
elif g in gender_dict['female']:
gender_aux.append('female')
break
else: # no gender match
gender_aux.append(' ')
print(len(gender_aux))
print(gender_aux)

输出

10
['female', 'male', ' ', ' ', 'male', 'male', ' ', ' ', 'male', 'male']

相关内容

最新更新