我遇到了一个不适合我的循环。我想要的是从基于最终数据帧的条件的数据帧中提取值。
我有:
最终数据帧:
final = {'code': ['A001','A002','A003'],
'reg': ['2234','3432', '6578'],
'name': ['Solutions BS', 'Flying 23', 'Fast Co'],
'df2_code': ['','',''],
'df2_name': ['', '', ''],
'df3_code': ['','',''],
'df3_name': ['', '', '']}
必须填充此数据帧。具体地,前缀为df2、df3、…的列,。。。
它必须填充其他数据帧的"code"one_answers"name"列,这些数据帧包含"最终数据帧"的相同的前三个列名(code、reg、name(。条件适用于填充,两个数据帧中的"reg"编号必须相同。
其他例子:
df2 = {'code': ['P001','A002','P003'],
'reg': ['2234','3432', '9978'],
'name': ['Chips 23', 'Flying 23', 'American99']}
所以,到目前为止,这个逻辑的产物是:
final = {'code': ['A001','A002','A003'],
'reg': ['2234','3432', '6578'],
'name': ['Solutions BS', 'Flying 23', 'Fast Co'],
'df2_code': ['P001','A002',''],
'df2_name': ['Chips 23', 'Flying 23', '']}
但是,问题有点复杂。df2中存在重复的"reg"编号,用作条件。所以"df2"实际上是:
df2 = {'code': ['P001','A002','P003', 'B004'],
'reg': ['2234','3432', '9978', '2234'],
'name': ['Chips 23', 'Flying 23', 'American99', NaN]}
这必须通过在同一单元格中添加两者的"代码"one_answers"名称"来考虑。产品将是:
final = {'code': ['A001','A002','A003'],
'reg': ['2234','3432', '6578'],
'name': ['Solutions BS', 'Flying 23', 'Fast Co'],
'df2_code': ['P001&B004','A002',''],
'df2_name': ['Chips 23', 'Flying 23', '']}
到目前为止,我只为一个数据帧(df2(编写了这段代码,因为最终的df有200000多行,所以需要花费太多时间(我有5个df要扫描,但这些更小(:
for i, row in final.iterrows():
for j, inrow in df2.iterrows():
if row['reg'] == inrow['reg']:
if final['df2_code'].iloc[i] == '':
final['df2_code'].iloc[i] = str(inrow['code'])
else:
final['df2_code'].iloc[i] += '&' + str(inrow['code'])
if inrow['name'] is None:
continue
else:
if final['df2_name'].iloc[i] == '':
final['df2_name'].iloc[i] = str(inrow['name'])
else:
final['df2_name'].iloc[i] += '&' + str(inrow['name'])
考虑Series.cat
+groupby
:
df2 = pd.DataFrame({'code': ['P001','A002','P003', 'B004'],
'reg': ['2234','3432', '9978', '2234'],
'name': ['Chips 23', 'Flying 23', 'American99', float('nan')]})
agg_df = (df2.assign(name = lambda x: x["name"].fillna(""))
.groupby(['reg'])
.agg({'code': lambda g: g.str.cat(sep="&"),
'name': 'max'})
.add_prefix("df2_")
)
agg_df
# df2_code df2_name
# reg
# 2234 P001&B004 Chips 23
# 3432 A002 Flying 23
# 9978 P003 American99
要在数据帧之间迭代,请从数据帧列表运行水平合并(在列表理解中使用元素zip
作为前缀(。
# LIST OF AGGREGATED DATA FRAMES
dfs = [
(df.assign(name = lambda x: x["name"].fillna(""))
.groupby(['reg'])
.agg({'code': lambda g: g.fillna("").str.cat(sep="&"),
'name': max})
.add_prefix(f"{nm}_")
)
for df, nm
in zip([df2, df3, df4, df5], ["df2", "df3", "df4", "df5"])
]
# HORIZONTAL MERGE ON "reg"
final_df = pd.concat(dfs, axis=1)