我有一个包含多个重复项的大数据表。目标是补充这些重复项,以获得具有最大信息量的行。想象一个具有以下结构的熊猫数据帧:
>>> import pandas as pd
>>> import numpy as np
>>> d = {'id': [1,1,1,2,2,3], 'col1': [25,np.NaN,25,3,np.NaN,1], 'col2':[np.NaN,'rrt',np.NaN,np.NaN,'sdf',np.NaN], 'col3':[300,300,np.NaN,500,500,600]}
>>> df = pd.DataFrame(data=d)
>>> df
id col1 col2 col3
0 1 25.0 NaN 300.0
1 1 NaN rrt 300.0
2 1 25.0 NaN NaN
3 2 3.0 NaN 500.0
4 2 NaN sdf 500.0
5 3 1.0 NaN 600.0
我希望得到的结果如下:
id col1 col2 col3
0 1 25.0 rrt 300.0
1 1 25.0 NaN 300.0
2 1 NaN rrt 300.0
3 1 25.0 NaN NaN
4 2 3.0 sdf 500.0
5 2 3.0 NaN 500.0
6 2 NaN sdf 500.0
7 3 1.0 NaN 600.0
(之后,我可以使用"我如何选择groupby中空值最少的组?"线程的解决方案来获得.
id col1 col2 col3
0 1 25.0 rrt 300.0
1 2 3.0 sdf 500.0
3 3 1.0 NaN 600.0
)您可以使用它。我希望这对你有帮助,
import pandas as pd
import numpy as np
d = {'id': [1,1,1,2,2,3], 'col1': [25,np.NaN,25,3,np.NaN,1], 'col2':[np.NaN,'rrt',np.NaN,np.NaN,'sdf',np.NaN], 'col3':[300,300,np.NaN,500,500,600]}
df1 = pd.DataFrame(data=d)
df2 = df1.groupby('id').agg({'col1': max, 'col3':max}).reset_index()
col = []
for index in range(len(df2)):
group_list = df1.groupby('id')['col2'].get_group(index + 1).tolist()
new_list = [item for item in group_list if not(pd.isnull(item)) == True]
if len(new_list)<1:
new_list.append(np.nan)
col.append(new_list[0])
df2['col2'] = col
final_df = pd.concat([df1, df2.dropna()], axis=0).sort_values(by=['id', 'col3']).reset_index().drop(columns='index')
最后的数据帧是
id col1 col2 col3
0 1 25.0 NaN 300.0
1 1 NaN rrt 300.0
2 1 25.0 rrt 300.0
3 1 25.0 NaN NaN
4 2 3.0 NaN 500.0
5 2 NaN sdf 500.0
6 2 3.0 sdf 500.0
7 3 1.0 NaN 600.0
df2的输出是
id col1 col3 col2
0 1 25.0 300.0 rrt
1 2 3.0 500.0 sdf
2 3 1.0 600.0 NaN