合并并附加数据帧,用NaN填充



我有两个数据帧要在'Id'上合并。这两个数据帧的大小不相同,并且Id可以在任一数据帧中重复。我希望在df1中为每个Id的第一个实例合并列,然后为不匹配的实例附加数据表。

df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 4, 4, 4],
"x": [3, 7, 8, 3, 5, 4, 1]})
df2 = pd.DataFrame({"Id": [1, 2, 2, 2, 3, 4, 4],
"y": [9, 7, 5, 8, 9, 6, 8]})

预期输出:

df1 = {"Id": [1, 1, 2, 2, 2, 3, 4, 4, 4],
"x": [3, 7, 8, NaN, NaN, 3, 5, 4, 1],
"y": [9, NaN, 7, 5, 8, 9, 6, 8, NaN]} 
}

编辑:将解决方案扩展到我的较大数据集中会导致错误:"float"对象不可迭代。

所有日期列都是datetime,UserId和ActivityCount是int64。

df = (
pd.concat(
[df_emails.groupby("UserId")["Email Dates"].agg(list), df_emails.groupby("UserId")["Click Dates"].agg(list), 
df_emails.groupby("UserId")["Open Dates"].agg(list), df_activity.groupby("UserId")["Activity Dates"].agg(list), 
df_activity.groupby("UserId")["ActivityCount"].agg(list)],
axis=1,
)
.reset_index()
.apply(
lambda x: [
(x["UserId"], *t) for t in zip_longest(x["Email Dates"], x["Click Dates"], x["Open Dates"], 
x["Activity Dates"], x["ActivityCount"], fillvalue=np.nan)
],
axis=1,
)
.explode()
.apply(pd.Series)
.rename(columns={0: "UserId", 1: "Email Dates", 2: "Click Dates", 3: "Open Dates", 4: "Activity Dates", 5: "ActivityCount"})
)

尝试:

from itertools import zip_longest

df = (
pd.concat(
[df1.groupby("Id")["x"].agg(list), df2.groupby("Id")["y"].agg(list)],
axis=1,
)
.fillna("")
.apply(list)
.reset_index()
.apply(
lambda x: [
(x["Id"], *t) for t in zip_longest(x["x"], x["y"], fillvalue=np.nan)
],
axis=1,
)
.explode()
.apply(pd.Series)
.rename(columns={0: "Id", 1: "x", 2: "y"})
)
print(df)

打印:

Id    x    y
0  1.0  3.0  9.0
0  1.0  7.0  NaN
1  2.0  8.0  7.0
1  2.0  NaN  5.0
1  2.0  NaN  8.0
2  3.0  3.0  9.0
3  4.0  5.0  6.0
3  4.0  4.0  8.0
3  4.0  1.0  NaN

最新更新