根据其他行的值修改数据框



我试图改变我的数据框架来创建一个Sankey图。

我有300万行这样的行:

client_id  |  |  start_date  |  end_date  |  position
1234          16-07-2019    27-03-2021       3
1234          18-07-2021    09-10-2021       1
1234          28-03-2021    17-07-2021       2
1234          10-10-2021    20-11-2021       2

我想让它看起来像这样:

client_id  |  |  start_date  |  end_date  |  position  |  source  |  target
1234          16-07-2019    27-03-2021       3           3          2
1234          18-07-2021    09-10-2021       1           1          2
1234          28-03-2021    17-07-2021       2           2          1
1234          10-10-2021    20-11-2021       2           2          4

值4是我在流中用作"出口"的值。

我不知道该怎么做。

Background:源值和目标值包含基于start_date和end_date的位置值。例如,在第一行中,源位置值为3,但目标位置值为2,因为在结束日期之后,客户端从位置3更改为2。

因为源和目标是根据每个客户机的日期顺序计算的。因此,可以对日期进行排序并找到它的下一个位置。

columns = ["client_id" ,"start_date","end_date","position"]
data = [
["1234","16-07-2019","27-03-2021",3],
["1234","18-07-2021","09-10-2021",1],
["1234","28-03-2021","17-07-2021",2],
["1234","10-10-2021","20-11-2021",2],
["5678","16-07-2019","27-03-2021",3],
["5678","18-07-2021","09-10-2021",1],
["5678","28-03-2021","17-07-2021",2],
["5678","10-10-2021","20-11-2021",2],
]
df = pd.DataFrame(
data,
columns=columns
)
df = df.assign(
start_date = pd.to_datetime(df["start_date"]),
end_date = pd.to_datetime(df["end_date"])
)

sdf = df.assign(
rank=df.groupby("client_id")["start_date"].rank()
)
sdf = sdf.assign(
next_rank=sdf["rank"] + 1
)
combine_result = pd.merge(sdf,
sdf[["client_id", "position", "rank"]],
left_on=["client_id", "next_rank"],
right_on=["client_id", "rank"],
how="left",
suffixes=["", "_next"]
).fillna({"position_next": 4})
combine_result[["client_id", "start_date", "end_date", "position", "position_next"]].rename(
{"position": "source", "position_next": "target"}, axis=1).sort_values(["client_id", "start_date"])

最新更新