Merge Multiples数据帧保留列,其余部分用NaN填充



我有一组数据帧df1, df2, ... dfn

dfs类似于:

id  |    date    | metric_value
001  | 2013-01-01 |     0.73
001  | 2013-03-01 |     0.73
002  | 2013-01-01 |     0.73
002  | 2013-02-01 |     0.73

但是iddate列之间不一定匹配,所以我可以有一个df1,比如:

id  |    date    | metric_value1
001  | 2013-01-01 |     0.73
001  | 2013-03-01 |     0.73
002  | 2013-01-01 |     0.73
002  | 2013-02-01 |     0.73
004  | 2013-03-01 |     0.73

还有一个类似df2的:

id   |    date    | metric_value2
001  | 2013-01-01 |     0.72
003  | 2013-02-01 |     0.72
003  | 2013-03-01 |     0.72
004  | 2013-01-01 |     0.72

我怎么能合并df1和df2,一般来说是df1 ... dfn,这样我就可以有这样的东西:

id   |    date    | metric_value1  | metric_value2
001  | 2013-01-01 |     0.73       |       0.72
001  | 2013-02-01 |      Nan       |       Nan
001  | 2013-03-01 |     0.73       |       Nan
002  | 2013-01-01 |     0.73       |       Nan
002  | 2013-02-01 |     0.73       |       Nan
002  | 2013-03-01 |      Nan       |       Nan
003  | 2013-01-01 |      Nan       |       Nan
003  | 2013-02-01 |      Nan       |       0.72
003  | 2013-03-01 |      Nan       |       0.72
004  | 2013-01-01 |      Nan       |       0.72
004  | 2013-02-01 |      Nan       |       Nan
004  | 2013-03-01 |     0.73       |       Nan

覆盖整个日期范围内的所有Id,从最小日期到最大日期

进一步研究@JonathanLeon解决方案:

import io
import pandas as pd
data='''id|date|metric_value1
001|2013-01-01|0.73
001|2013-03-01|0.73
002|2013-01-01|0.73
002|2013-02-01|0.73
004|2013-03-01|0.73'''
df1 = pd.read_csv(io.StringIO(data), sep='|', engine='python')
data='''id|date|metric_value2
001|2013-01-01|0.72
003|2013-02-01|0.72
003|2013-03-01|0.72
004|2013-01-01|0.72'''
df2 = pd.read_csv(io.StringIO(data), sep='|', engine='python')
df_out = df1.merge(df2, on=['id', 'date'], how='outer')
df_out['date'] = pd.to_datetime(df_out['date'])
df_out.set_index(['id', 'date'])
.reindex(pd.MultiIndex.from_product([df_out['id'].unique(),
df_out['date'].unique()],
names=['id', 'date']))
.sort_index()
.reset_index()

输出:

id       date  metric_value1  metric_value2
0    1 2013-01-01           0.73           0.72
1    1 2013-02-01            NaN            NaN
2    1 2013-03-01           0.73            NaN
3    2 2013-01-01           0.73            NaN
4    2 2013-02-01           0.73            NaN
5    2 2013-03-01            NaN            NaN
6    3 2013-01-01            NaN            NaN
7    3 2013-02-01            NaN           0.72
8    3 2013-03-01            NaN           0.72
9    4 2013-01-01            NaN           0.72
10   4 2013-02-01            NaN            NaN
11   4 2013-03-01           0.73            NaN

尝试:

data='''id|date|metric_value1
001|2013-01-01|0.73
001|2013-03-01|0.73
002|2013-01-01|0.73
002|2013-02-01|0.73
004|2013-03-01|0.73'''
df1 = pd.read_csv(io.StringIO(data), sep='|', engine='python')
data='''id|date|metric_value2
001|2013-01-01|0.72
003|2013-02-01|0.72
003|2013-03-01|0.72
004|2013-01-01|0.72'''
df2 = pd.read_csv(io.StringIO(data), sep='|', engine='python')
df1.merge(df2, on=['id', 'date'], how='outer')

输出:

id        date  metric_value1  metric_value2
0   1  2013-01-01          0.730          0.720
1   1  2013-03-01          0.730            NaN
2   2  2013-01-01          0.730            NaN
3   2  2013-02-01          0.730            NaN
4   4  2013-03-01          0.730            NaN
5   3  2013-02-01            NaN          0.720
6   3  2013-03-01            NaN          0.720
7   4  2013-01-01            NaN          0.720
import pandas
import datetime
#build your list of unique ids
ids = pandas.concat([df1['id'], df2['id']])
ids = pandas.Series(ids.unique())
#can do as above to get all possible dates, I've just generated them.
dates = pandas.DataFrame(pandas.date_range(datetime.date.today(), freq='D', periods = 10), columns=['date'])
#use merge to generate the cartesian product of all dates and all ids
combinations = pandas.merge(left=dates, right=pandas.DataFrame(ids.unique(), columns=['id']), how='outer', left_index=True, right_index=True)
#merge your dataframes on your 'key' columns
df3 = pandas.merge(left=dates, right=df1, on=['date', 'id'], how='left')
df4 = pandas.merge(left=dates, right=df2, on=['date', 'id'], how='left')

最新更新