我有一个CSV格式的数据。
Patient_ID,Analyte_line
KYN059AQP,"[['Urea', 3.0, '3', ''], ['Creatinine', 3.0, '3', ''], ['Uric Acid', 3.0, '3', '']]"
KQT767JLU,"[['Total Protein', '', '6', ''], ['Albumin', '', '6', ''], ['Globulin', '', '4', ''], ['Total Bilirubin', '', '6', ''], ['Direct Bilirubin', '', '4', ''], ['Indirect Bilirubin', '', '4', ''], ['Alkaline Phosphatase', '', '4', ''], ['SGPT', '', '5', ''], ['SGOT', '', '5', ''], ['Gamma GT', '', '5', ''], ['AG Ratio', '', '4', '']]"
PWV009AGQ,"[['HGB', '', '18', ''], ['RBC', '', '1', ''], ['HCT', '', '2', ''], ['MCV', '', '3', ''], ['MCH', '', '3', ''], ['MCHC', '', '3', ''], ['RDWcv', '', '2', ''], ['RDWsd', '', '3', ''], ['WBC', '', '4', ''], ['NEU', '', '5', ''], ['LYM', '', '6', ''], ['MON', '', '', ''], ['BAS', '', '', ''], ['EO', '', '', ''], ['NEU%', '', '', ''], ['LYM%', '', '', ''], ['MON%', '', '', ''], ['EO%', '', '', ''], ['BAS%', '', '', ''], ['PLT', '', '170', ''], ['PCT', '', '3', ''], ['MPV', '', '', ''], ['PDWsd', '', '', ''], ['PDWcv', '', '', ''], ['ESR', '', '5', ''], ['GRA#', '', '', '']]"
PWV009AGQ,"[['Total Protein', '', '23', ''], ['Albumin', '', '2', ''], ['Globulin', '', '2', ''], ['Total Bilirubin', '', '2', ''], ['Direct Bilirubin', 2.0, '', ''], ['Indirect Bilirubin', 2.0, '', ''], ['Alkaline Phosphatase', '', '3', ''], ['SGPT', 1.0, '', ''], ['SGOT', '', '4', ''], ['Gamma GT', 33.0, '31', ''], ['AG Ratio', '', '2', '']]"
WUY523UZO,"[['HGB', '', '11', ''], ['RBC', '', '2', ''], ['HCT', '', '4', ''], ['MCV', '', '5', ''], ['MCH', '', '6', ''], ['MCHC', '', '6', ''], ['RDWcv', '', '7', ''], ['RDWsd', '', '8', ''], ['WBC', '', '9', ''], ['NEU', '', '9', ''], ['LYM', '', '1', ''], ['MON', '', '', ''], ['BAS', '', '', ''], ['EO', '', '', ''], ['NEU%', '', '', ''], ['LYM%', '', '', ''], ['MON%', '', '', ''], ['EO%', '', '', ''], ['BAS%', '', '', ''], ['PLT', '', '', ''], ['PCT', '', '', ''], ['MPV', '', '', ''], ['PDWsd', '', '', ''], ['PDWcv', '', '', ''], ['ESR', '', '', ''], ['GRA#', '', '', '']]"
ZMO679WDS,"[['Dengue Ig-G', '', '8', '']]"
TVZ695TUB,"[['Rapid Malaria', '', 'Negative', '']]"
我想用合适的格式设置:
<表类>Patient_ID 尿素 快速疟疾 血糖为 血红蛋白 tbody><<tr>KYN059AQP 3.0, 3 - 3 - KQT767JLU - - - - PWV009AGQ 4.0 - - 9 TVZ695TUB - - - - 表类>
建议您使用简单的列表而不是字符串列表的列表,因为它可能使读取数据框架中的数据变得困难,对于本例,我使用StringIO
,从pandas
中可以使用read_csv
作为逗号分隔的文件读取。
from io import StringIO
from ast import literal_eval
import pandas as pd
import numpy as np
data = StringIO("""
Patient_ID,Analyte_line
KYN059AQP,"[['Urea', 3.0, '3', ''], ['Creatinine', 3.0, '3', ''], ['Uric Acid', 3.0, '3', '']]"
KQT767JLU,"[['Total Protein', '', '6', ''], ['Albumin', '', '6', ''], ['Globulin', '', '4', ''], ['Total Bilirubin', '', '6', ''], ['Direct Bilirubin', '', '4', ''], ['Indirect Bilirubin', '', '4', ''], ['Alkaline Phosphatase', '', '4', ''], ['SGPT', '', '5', ''], ['SGOT', '', '5', ''], ['Gamma GT', '', '5', ''], ['AG Ratio', '', '4', '']]"
PWV009AGQ,"[['HGB', '', '18', ''], ['RBC', '', '1', ''], ['HCT', '', '2', ''], ['MCV', '', '3', ''], ['MCH', '', '3', ''], ['MCHC', '', '3', ''], ['RDWcv', '', '2', ''], ['RDWsd', '', '3', ''], ['WBC', '', '4', ''], ['NEU', '', '5', ''], ['LYM', '', '6', ''], ['MON', '', '', ''], ['BAS', '', '', ''], ['EO', '', '', ''], ['NEU%', '', '', ''], ['LYM%', '', '', ''], ['MON%', '', '', ''], ['EO%', '', '', ''], ['BAS%', '', '', ''], ['PLT', '', '170', ''], ['PCT', '', '3', ''], ['MPV', '', '', ''], ['PDWsd', '', '', ''], ['PDWcv', '', '', ''], ['ESR', '', '5', ''], ['GRA#', '', '', '']]"
PWV009AGQ,"[['Total Protein', '', '23', ''], ['Albumin', '', '2', ''], ['Globulin', '', '2', ''], ['Total Bilirubin', '', '2', ''], ['Direct Bilirubin', 2.0, '', ''], ['Indirect Bilirubin', 2.0, '', ''], ['Alkaline Phosphatase', '', '3', ''], ['SGPT', 1.0, '', ''], ['SGOT', '', '4', ''], ['Gamma GT', 33.0, '31', ''], ['AG Ratio', '', '2', '']]"
WUY523UZO,"[['HGB', '', '11', ''], ['RBC', '', '2', ''], ['HCT', '', '4', ''], ['MCV', '', '5', ''], ['MCH', '', '6', ''], ['MCHC', '', '6', ''], ['RDWcv', '', '7', ''], ['RDWsd', '', '8', ''], ['WBC', '', '9', ''], ['NEU', '', '9', ''], ['LYM', '', '1', ''], ['MON', '', '', ''], ['BAS', '', '', ''], ['EO', '', '', ''], ['NEU%', '', '', ''], ['LYM%', '', '', ''], ['MON%', '', '', ''], ['EO%', '', '', ''], ['BAS%', '', '', ''], ['PLT', '', '', ''], ['PCT', '', '', ''], ['MPV', '', '', ''], ['PDWsd', '', '', ''], ['PDWcv', '', '', ''], ['ESR', '', '', ''], ['GRA#', '', '', '']]"
ZMO679WDS,"[['Dengue Ig-G', '', '8', '']]"
TVZ695TUB,"[['Rapid Malaria', '', 'Negative', '']]"
""")
df = pd.read_csv(data)
df['Analyte_line'] = df['Analyte_line'].apply(lambda x: literal_eval(x))
for col_name in ['Patient_ID', 'Urea', 'Rapid Malaria', 'SGPT', 'HGB']:
df[col_name] = [np.nan for i in range(len(df['Analyte_line']))]
返回如下内容,现在列'Analyte_line'
由列表的列表(而不是字符串)组成,那么它可以确定其余的列
>>> df
Patient_ID Analyte_line
0 KYN059AQP [[Urea, 3.0, 3, ], [Creatinine, 3.0, 3, ], [Ur...
1 KQT767JLU [[Total Protein, , 6, ], [Albumin, , 6, ], [Gl...
2 PWV009AGQ [[HGB, , 18, ], [RBC, , 1, ], [HCT, , 2, ], [M...
3 PWV009AGQ [[Total Protein, , 23, ], [Albumin, , 2, ], [G...
4 WUY523UZO [[HGB, , 11, ], [RBC, , 2, ], [HCT, , 4, ], [M...
5 ZMO679WDS [[Dengue Ig-G, , 8, ]]
6 TVZ695TUB [[Rapid Malaria, , Negative, ]]
因此,通过迭代'Analyte_line'
列的值并检查有效的列名,我可以分配相应的值
for i in range(len(df['Analyte_line'])):
for df_arr in df['Analyte_line'][i]:
if df_arr[0] in ['Urea', 'Rapid Malaria', 'SGPT', 'HGB']:
df[df_arr[0]][i] = "".join(map(lambda x: str(x),df_arr[1:]))
>>> df
Patient_ID Urea Rapid Malaria SGPT HGB
0 KYN059AQP 3.03 NaN NaN NaN
1 KQT767JLU NaN NaN 5 NaN
2 PWV009AGQ NaN NaN NaN 18
3 PWV009AGQ NaN NaN 1.0 NaN
4 WUY523UZO NaN NaN NaN 11
5 ZMO679WDS NaN NaN NaN NaN
6 TVZ695TUB NaN Negative NaN NaN
建议使用正确的数据类型,以便您可以以较不复杂的方式创建数据框架