我正在尝试合并下面的2个数据帧,以获得一个输出,其中每个代码都在每个日期列出,如果代码在该日期不在原始数据帧中,则数量将填充为0。我在下面举了一个我的输入和期望输出的例子,但我的实时数据将有超过一年的日期和超过20000个代码。
输入数据:
df1
date
0 2021-05-03
1 2021-05-04
2 2021-05-05
3 2021-05-06
4 2021-05-07
5 2021-05-08
6 2021-05-09
7 2021-05-10
df2
date code qty
0 2021-05-03 A 2
1 2021-05-06 A 5
2 2021-05-07 A 4
3 2021-05-08 A 5
4 2021-05-10 A 6
5 2021-05-04 B 1
6 2021-05-08 B 4
期望输出:
date code qty
03/05/2021 A 2
03/05/2021 B 0
04/05/2021 A 0
04/05/2021 B 1
05/05/2021 A 0
05/05/2021 B 0
06/05/2021 A 5
06/05/2021 B 0
07/05/2021 A 4
07/05/2021 B 0
08/05/2021 A 5
08/05/2021 B 4
09/05/2021 A 0
09/05/2021 B 0
10/05/2021 A 6
10/05/2021 B 0
我已经尝试了下面的合并,但我得到的输出似乎没有达到预期:
df_new = df1.merge(df2, how='left', on='date')
date code qty
0 2021-05-03 A 2.0
1 2021-05-04 B 1.0
2 2021-05-05 NaN NaN
3 2021-05-06 A 5.0
4 2021-05-07 A 4.0
5 2021-05-08 A 5.0
6 2021-05-08 B 4.0
7 2021-05-09 NaN NaN
8 2021-05-10 A 6.0
这更适合reindex
。创建所有组合,设置索引,重新索引到所有这些组合,填充,然后重置索引。
import pandas as pd
idx = pd.MultiIndex.from_product([df1.date, df2['code'].unique()],
names=['date', 'code'])
df2 = (df2.set_index(['date', 'code'])
.reindex(idx)
.fillna(0, downcast='infer')
.reset_index())
date code qty
0 2021-05-03 A 2
1 2021-05-03 B 0
2 2021-05-04 A 0
3 2021-05-04 B 1
4 2021-05-05 A 0
5 2021-05-05 B 0
6 2021-05-06 A 5
7 2021-05-06 B 0
8 2021-05-07 A 4
9 2021-05-07 B 0
10 2021-05-08 A 5
11 2021-05-08 B 4
12 2021-05-09 A 0
13 2021-05-09 B 0
14 2021-05-10 A 6
15 2021-05-10 B 0
带有pivot
和stack
的一个选项:
(df2.pivot_table(index='date', columns='code', fill_value=0)
.reindex(df1.date, fill_value=0)
.stack('code')
.reset_index()
)
输出:
date code qty
0 2021-05-03 A 2
1 2021-05-03 B 0
2 2021-05-04 A 0
3 2021-05-04 B 1
4 2021-05-05 A 0
5 2021-05-05 B 0
6 2021-05-06 A 5
7 2021-05-06 B 0
8 2021-05-07 A 4
9 2021-05-07 B 0
10 2021-05-08 A 5
11 2021-05-08 B 4
12 2021-05-09 A 0
13 2021-05-09 B 0
14 2021-05-10 A 6
15 2021-05-10 B 0
在code
的df1
和unique vals
之间执行cross-join
。然后使用df.fillna()
:
In [480]: x = pd.DataFrame(df2.code.unique())
In [483]: y = df1.assign(key=1).merge(x.assign(key=1), on='key').drop('key', 1).rename(columns={0: 'code'})
In [486]: res = y.merge(df2, how='left').fillna(0)
In [487]: res
Out[487]:
date code qty
0 2021-05-03 A 2.0
1 2021-05-03 B 0.0
2 2021-05-04 A 0.0
3 2021-05-04 B 1.0
4 2021-05-05 A 0.0
5 2021-05-05 B 0.0
6 2021-05-06 A 5.0
7 2021-05-06 B 0.0
8 2021-05-07 A 4.0
9 2021-05-07 B 0.0
10 2021-05-08 A 5.0
11 2021-05-08 B 4.0
12 2021-05-09 A 0.0
13 2021-05-09 B 0.0
14 2021-05-10 A 6.0
15 2021-05-10 B 0.0