我之前的问题没有得到任何正确的答案:
如何在将列表转换为数据帧时优化时间?
让我进一步解释一下这个例子:
让我们更准确地将数据帧视为
First Name Last Name Country Address Age Age-Group Photo1 Photo2 Phototype
Mark Shelby US Petersburg 42 Adult 1.jpg 2.jpg PP
Andy Carnot GE Freiburg 16 Teen 1.jpg PP
当转换为csv作为时,我想要输出数据帧
N,Mark,Shelby,US
AG,43,Adult
AD,Petersburg
PH,1.jpg,PP
PH,2.jpg,PP
N,Andy,Carnot,GE
AG,16,Teen
AD,Freiburg
PH,1.jpg,PP
字符PH、AG、AD、N不应映射。它可以是任何字符。
当在该列表上循环并映射和转换为数据帧时,这一操作效果良好。但是,对于大型数据集来说,这需要花费大量时间。这个过程的确切代码在前面的问题中
[['N','First Name','Last Name', 'Country'],
['AG','Age','Age-Group'],
['AD','Address'],
['PH','Photo1','Phototype'],
['PH','Photo2','Phototype'],
]
使用:
print (df)
First Name Last Name Country Address Age Age-Group Photo1 Photo2
0 Mark Shelby US Petersburg 42 Adult 1.jpg 2.jpg
1 Andy Carnot GE Freiburg 16 Teen 1.jpg NaN
Phototype
0 PP
1 PP
第一个是根据最终列表中的第一个值为键定义字典,所有列都用字符串排列:
d = {'N':['First Name','Last Name', 'Country'],
'AG':['Age','Age-Group'],
'AD':['Address'],
'PH':['Photo','Phototype']}
然后根据字典中的列表过滤数据帧:
out = {k: df.loc[:, df.columns.str.startswith(tuple(v))] for k, v in d.items()}
对于PH
是必要的熔化更改格式:
out['PH'] = (out['PH'].melt('Phototype',
value_name='Photo',
ignore_index=False)
.drop('variable',1)[['Photo','Phototype']]
.dropna(subset=['Photo']))
最后创建相同的列,并通过concat
连接,排序正确:
out = {k: v.set_axis(range(len(v.columns)), axis=1) for k, v in out.items()}
df = pd.concat(out).sort_index(level=1,sort_remaining=False).reset_index(level=0).fillna('')
print (df)
level_0 0 1 2
0 N Mark Shelby US
0 AG 42 Adult
0 AD Petersburg
0 PH 1.jpg PP
0 PH 2.jpg PP
1 N Andy Carnot GE
1 AG 16 Teen
1 AD Freiburg
1 PH 1.jpg PP
上次通过删除空字符串创建不同长度的列表:
fin = [x[x!= ''].tolist() for x in df.to_numpy() ]
print (fin)
[['N', 'Mark', 'Shelby', 'US'],
['AG', 42, 'Adult'],
['AD', 'Petersburg'],
['PH', '1.jpg', 'PP'],
['PH', '2.jpg', 'PP'],
['N', 'Andy', 'Carnot', 'GE'],
['AG', 16, 'Teen'],
['AD', 'Freiburg'],
['PH', '1.jpg', 'PP']]
编辑:对于使用数字的匹配Photo
,使用正则表达式,因此startswith
使用contains
,列表值由正则表达式OR
:的|
连接
d = {'N':['First Name','Last Name', 'Country'],
'AG':['Age','Age-Group'],
'AD':['Address'],
'PH':['Photod+','Phototype']}
out = {k: df.loc[:, df.columns.str.contains('|'.join(v))] for k, v in d.items()}
print (out)
{'N': First Name Last Name Country
0 Mark Shelby US
1 Andy Carnot GE, 'AG': Age Age-Group
0 42 Adult
1 16 Teen, 'AD': Address
0 Petersburg
1 Freiburg, 'PH': Photo1 Photo2 Phototype
0 1.jpg 2.jpg PP
1 1.jpg NaN PP}
编辑:技巧是将^
添加到字符串的开头,将$
添加到字符串末尾,以获得完全匹配的值,然后是正确工作Photo
+"数字"所必需的:
print (df)
First Name Last Name Country Address Age Age-Group Photo1 Photo2
0 Mark Shelby US Petersburg 42 Adult 1.jpg 2.jpg
1 Andy Carnot GE Freiburg 16 Teen 1.jpg NaN
Phototype Age Detail Address Detail
0 PP Young Far
1 PP Too Young Near
d = {'N':['First Name','Last Name', 'Country'],
'AG':['Age','Age-Group'],
'AD':['Address'],
'PH':['Photod+','Phototype']}
d = {k: [rf'^{x}$' for x in v] for k, v in d.items()}
print (d)
{'N': ['^First Name$', '^Last Name$', '^Country$'],
'AG': ['^Age$', '^Age-Group$'],
'AD': ['^Address$'],
'PH': ['^Photo\d+$', '^Phototype$']}
out = {k: df.loc[:, df.columns.str.contains('|'.join(v))] for k, v in d.items()}
print (out['AG'])
Age Age-Group
0 42 Adult
1 16 Teen
print (out['AD'])
Address
0 Petersburg
1 Freiburg