我有两个数据帧要在'Id'上合并。这两个数据帧的大小不相同,并且Id可以在任一数据帧中重复。我希望在df1中为每个Id的第一个实例合并列,然后为不匹配的实例附加数据表。
df1 = pd.DataFrame({"Id": [1, 1, 2, 3, 4, 4, 4],
"x": [3, 7, 8, 3, 5, 4, 1]})
df2 = pd.DataFrame({"Id": [1, 2, 2, 2, 3, 4, 4],
"y": [9, 7, 5, 8, 9, 6, 8]})
预期输出:
df1 = {"Id": [1, 1, 2, 2, 2, 3, 4, 4, 4],
"x": [3, 7, 8, NaN, NaN, 3, 5, 4, 1],
"y": [9, NaN, 7, 5, 8, 9, 6, 8, NaN]}
}
编辑:将解决方案扩展到我的较大数据集中会导致错误:"float"对象不可迭代。
所有日期列都是datetime,UserId和ActivityCount是int64。
df = (
pd.concat(
[df_emails.groupby("UserId")["Email Dates"].agg(list), df_emails.groupby("UserId")["Click Dates"].agg(list),
df_emails.groupby("UserId")["Open Dates"].agg(list), df_activity.groupby("UserId")["Activity Dates"].agg(list),
df_activity.groupby("UserId")["ActivityCount"].agg(list)],
axis=1,
)
.reset_index()
.apply(
lambda x: [
(x["UserId"], *t) for t in zip_longest(x["Email Dates"], x["Click Dates"], x["Open Dates"],
x["Activity Dates"], x["ActivityCount"], fillvalue=np.nan)
],
axis=1,
)
.explode()
.apply(pd.Series)
.rename(columns={0: "UserId", 1: "Email Dates", 2: "Click Dates", 3: "Open Dates", 4: "Activity Dates", 5: "ActivityCount"})
)
尝试:
from itertools import zip_longest
df = (
pd.concat(
[df1.groupby("Id")["x"].agg(list), df2.groupby("Id")["y"].agg(list)],
axis=1,
)
.fillna("")
.apply(list)
.reset_index()
.apply(
lambda x: [
(x["Id"], *t) for t in zip_longest(x["x"], x["y"], fillvalue=np.nan)
],
axis=1,
)
.explode()
.apply(pd.Series)
.rename(columns={0: "Id", 1: "x", 2: "y"})
)
print(df)
打印:
Id x y
0 1.0 3.0 9.0
0 1.0 7.0 NaN
1 2.0 8.0 7.0
1 2.0 NaN 5.0
1 2.0 NaN 8.0
2 3.0 3.0 9.0
3 4.0 5.0 6.0
3 4.0 4.0 8.0
3 4.0 1.0 NaN