在Pandas中组合重复项以获得尽可能多的数据



我有一个包含多个重复项的大数据表。目标是补充这些重复项,以获得具有最大信息量的行。想象一个具有以下结构的熊猫数据帧:

>>> import pandas as pd
>>> import numpy as np
>>> d = {'id': [1,1,1,2,2,3], 'col1': [25,np.NaN,25,3,np.NaN,1], 'col2':[np.NaN,'rrt',np.NaN,np.NaN,'sdf',np.NaN], 'col3':[300,300,np.NaN,500,500,600]}
>>> df = pd.DataFrame(data=d)
>>> df
id  col1 col2   col3
0   1  25.0  NaN  300.0
1   1   NaN  rrt  300.0
2   1  25.0  NaN    NaN
3   2   3.0  NaN  500.0
4   2   NaN  sdf  500.0
5   3   1.0  NaN  600.0

我希望得到的结果如下:

id  col1 col2   col3
0   1  25.0  rrt  300.0
1   1  25.0  NaN  300.0
2   1   NaN  rrt  300.0
3   1  25.0  NaN    NaN
4   2   3.0  sdf  500.0
5   2   3.0  NaN  500.0
6   2   NaN  sdf  500.0
7   3   1.0  NaN  600.0

(之后,我可以使用"我如何选择groupby中空值最少的组?"线程的解决方案来获得.

id  col1 col2   col3
0   1  25.0  rrt  300.0
1   2   3.0  sdf  500.0
3   3   1.0  NaN  600.0
)

您可以使用它。我希望这对你有帮助,

import pandas as pd
import numpy as np
d = {'id': [1,1,1,2,2,3], 'col1': [25,np.NaN,25,3,np.NaN,1], 'col2':[np.NaN,'rrt',np.NaN,np.NaN,'sdf',np.NaN], 'col3':[300,300,np.NaN,500,500,600]}
df1 = pd.DataFrame(data=d)
df2 = df1.groupby('id').agg({'col1': max, 'col3':max}).reset_index()
col = []
for index in range(len(df2)):
group_list = df1.groupby('id')['col2'].get_group(index + 1).tolist()
new_list = [item for item in group_list if not(pd.isnull(item)) == True]
if len(new_list)<1:
new_list.append(np.nan)
col.append(new_list[0])
df2['col2'] = col
final_df = pd.concat([df1, df2.dropna()], axis=0).sort_values(by=['id', 'col3']).reset_index().drop(columns='index')

最后的数据帧是

id  col1 col2   col3
0   1  25.0  NaN  300.0
1   1   NaN  rrt  300.0
2   1  25.0  rrt  300.0
3   1  25.0  NaN    NaN
4   2   3.0  NaN  500.0
5   2   NaN  sdf  500.0
6   2   3.0  sdf  500.0
7   3   1.0  NaN  600.0

df2的输出是

id  col1   col3 col2
0   1  25.0  300.0  rrt
1   2   3.0  500.0  sdf
2   3   1.0  600.0  NaN

最新更新