我的数据如下所示:
date, cola, colb, colc
1,10,,
2,11,,
3,12,,
4,13,,
1,,14,
2,,15,
3,,16,
4,,17,
1,,,17
2,,,18
3,,,19
4,13,,20
我想根据第一列合并行,并使输出如下所示:
date, cola, colb, colc
1,10,14,17
2,11,15,18
3,12,16,19
4,13,17,20
我不能保证不会有任何冲突,所以我希望能够选择最大值或平均值。
您可以使用
groupby
. 从具有重复项的csv
开始:
>>> !cat tomerge.csv
date, cola, colb, colc
1,10,,
2,11,,
1,,14,
2,,15,
1,,24,
2,,40,
1,,,17
2,,,18
阅读内容:
>>> df = pd.read_csv("tomerge.csv")
>>> df
date cola colb colc
0 1 10 NaN NaN
1 2 11 NaN NaN
2 1 NaN 14 NaN
3 2 NaN 15 NaN
4 1 NaN 24 NaN
5 2 NaN 40 NaN
6 1 NaN NaN 17
7 2 NaN NaN 18
然后奇迹发生了:
>>> df.groupby("date").mean()
cola colb colc
date
1 10 19.0 17
2 11 27.5 18
>>> df.groupby("date").max()
cola colb colc
date
1 10 24 17
2 11 40 18