pandas ' pd.熔化的多索引列使用



我在试图写出易懂的pandas时遇到了麻烦,这让我觉得我错过了一些功能或用法(可能是pd.melt方法)。

我有两个数据集要合并。两者相似:

  • time表示状态变化
  • nameinstance复合标识,用于唯一标识一个事物的记录条目。
  • 最后一个单一的值命名的状态已经改变了,在那个时候的东西。

所以我想要组合的每个数据集的一个示例记录将是:

  1. dict(time=0, name="a", instance=0, state=1)
  2. dict(time=5, name="a", instance=0, location="london")

我想将这两个记录集合并为一个,其中每个(name, instance)在每次都有最后已知的statelocation

[
dict(time=0, name="a", instance=0, state=1, location=np.nan),
dict(time=5, name="a", instance=0, state=1, location="london"),
]

为了达到这一点,我目前做pd.DataFrame.pivot_table,pd.DataFrame.ffill,pd.DataFrame.meltpd.DataFrame.reset_index的组合。它似乎按预期工作,但感觉非常麻烦/不可读,特别是一旦我进入使用pd.DataFrame.melt

我觉得我错过了pd.DataFrame.melt函数的一些用法,但我不太确定如何将文档应用于我正在使用的pd.MultiIndex列的数据集,或者如果我错过了其他一些pandas实用程序,我应该使用

如果我发现melt不是我应该使用的,我会用更合适的东西更新问题标题。

我有:

import pandas as pd
states = [
dict(time=0, name="a", instance=0, state=0),
dict(time=0, name="a", instance=1, state=0),
dict(time=0, name="a", instance=2, state=0),
dict(time=0, name="b", instance=1, state=0),
dict(time=0, name="b", instance=2, state=0),
dict(time=1, name="a", instance=1, state=1),
dict(time=2, name="a", instance=2, state=1),
dict(time=2, name="b", instance=1, state=1),
]
locations = [
dict(time=0, name="a", instance=0, location="tokyo"),
dict(time=0, name="a", instance=1, location="tokyo"),
dict(time=0, name="a", instance=2, location="tokyo"),
dict(time=0, name="b", instance=1, location="tokyo"),
dict(time=0, name="b", instance=2, location="tokyo"),
dict(time=1, name="a", instance=0, location="london"),
dict(time=1, name="a", instance=2, location="london"),
dict(time=1, name="b", instance=1, location="london"),
dict(time=1, name="b", instance=2, location="london"),
dict(time=1, name="a", instance=1, location="paris"),
dict(time=2, name="a", instance=2, location="paris"),
dict(time=2, name="b", instance=1, location="paris"),
]
states = pd.DataFrame.from_dict(states)
locations = pd.DataFrame.from_dict(locations)
combined = pd.concat([states, locations], axis="index")
combined = combined.pivot_table(
index="time",
columns=["name", "instance"],
values=["state", "location"],
aggfunc="last",
)
combined = combined.ffill()
ugly_melt = combined.melt(ignore_index=False)
ugly_melt = ugly_melt.rename(columns={None: "state_status"})
ugly_melt = (
ugly_melt.reset_index()
.pivot(
index=["time", "name", "instance"],
columns=["state_status"],
values="value",
)
.reset_index()
)
print(ugly_melt)

您可以简单地对两个数据集进行连接和过滤,记住time约束(只能连接以前/当前状态,而不能连接未来状态)。

这样就不需要执行前向填充ffill(),并且使用连接和过滤器比pd.melt方法更容易理解。

初始化代码

states = [
dict(time=0, name="a", instance=0, state=0),
dict(time=0, name="a", instance=1, state=0),
dict(time=0, name="a", instance=2, state=0),
dict(time=0, name="b", instance=1, state=0),
dict(time=0, name="b", instance=2, state=0),
dict(time=1, name="a", instance=1, state=1),
dict(time=2, name="a", instance=2, state=1),
dict(time=2, name="b", instance=1, state=1),
]
locations = [
dict(time=0, name="a", instance=0, location="tokyo"),
dict(time=0, name="a", instance=1, location="tokyo"),
dict(time=0, name="a", instance=2, location="tokyo"),
dict(time=0, name="b", instance=1, location="tokyo"),
dict(time=0, name="b", instance=2, location="tokyo"),
dict(time=1, name="a", instance=0, location="london"),
dict(time=1, name="a", instance=2, location="london"),
dict(time=1, name="b", instance=1, location="london"),
dict(time=1, name="b", instance=2, location="london"),
dict(time=1, name="a", instance=1, location="paris"),
dict(time=2, name="a", instance=2, location="paris"),
dict(time=2, name="b", instance=1, location="paris"),
]

实施

import pandas as pd
"""
Steps:
1. Convert to dataframe (Rename state time as state_time, keep location time as time)
2. Merge both dataframe together
3. Filter state time <= location time (since location uses current/previous state)
4. Filter for latest state time (since location must remember the latest state and not all previous states)
"""
# Step 1
states = pd.DataFrame(states).rename(columns={"time": "state_time"})
locations = pd.DataFrame(locations)
# Step 2
merged_df = pd.merge(locations, states, on=["name", "instance"])
# Step 3
merged_df = merged_df[merged_df["state_time"] <= merged_df["time"]]
# Step 4
merged_df = merged_df
.sort_values(["time", "name", "instance", "state_time"])
.drop_duplicates(["time", "name", "instance"], keep="last")
.reset_index(drop=True)
.drop(columns=["state_time"])

结果如下merged_df

time name  instance location  state
0      0    a         0    tokyo      0
1      0    a         1    tokyo      0
2      0    a         2    tokyo      0
3      0    b         1    tokyo      0
4      0    b         2    tokyo      0
5      1    a         0   london      0
6      1    a         1    paris      1
7      1    a         2   london      0
8      1    b         1   london      0
9      1    b         2   london      0
10     2    a         2    paris      1
11     2    b         1    paris      1

结果的长度遵循location数据,如果您希望每个名称-实例-位置都有一个时间,您可以事先进行外部连接。

注意combinedffill之后是这样的。

location                                state                    
name            a                      b             a              b     
instance        0      1       2       1       2     0    1    2    1    2
time                                                                      
0           tokyo  tokyo   tokyo   tokyo   tokyo   0.0  0.0  0.0  0.0  0.0
1          london  paris  london  london  london   0.0  1.0  0.0  0.0  0.0
2          london  paris   paris   paris  london   0.0  1.0  1.0  1.0  0.0

有2列,locationstate。您可以分别meltmerge

melt1 = pd.melt(combined["location"], value_name="location", ignore_index=False).reset_index()
melt2 = pd.melt(combined["state"], value_name="state", ignore_index=False).reset_index()
better_melt = melt1.merge(melt2, on=["time", "name", "instance"])
better_melt
time name  instance location  state
0      0    a         0    tokyo    0.0
1      1    a         0   london    0.0
2      2    a         0   london    0.0
3      0    a         1    tokyo    0.0
4      1    a         1    paris    1.0
5      2    a         1    paris    1.0
6      0    a         2    tokyo    0.0
7      1    a         2   london    0.0
8      2    a         2    paris    1.0
9      0    b         1    tokyo    0.0
10     1    b         1   london    0.0
11     2    b         1    paris    1.0
12     0    b         2    tokyo    0.0
13     1    b         2   london    0.0
14     2    b         2   london    0.0

数据集:

import pandas as pd
states = [
dict(time=0, name="a", instance=0, state=0),
dict(time=0, name="a", instance=1, state=0),
dict(time=0, name="a", instance=2, state=0),
dict(time=0, name="b", instance=1, state=0),
dict(time=0, name="b", instance=2, state=0),
dict(time=1, name="a", instance=1, state=1),
dict(time=2, name="a", instance=2, state=1),
dict(time=2, name="b", instance=1, state=1),
]
locations = [
dict(time=0, name="a", instance=0, location="tokyo"),
dict(time=0, name="a", instance=1, location="tokyo"),
dict(time=0, name="a", instance=2, location="tokyo"),
dict(time=0, name="b", instance=1, location="tokyo"),
dict(time=0, name="b", instance=2, location="tokyo"),
dict(time=1, name="a", instance=0, location="london"),
dict(time=1, name="a", instance=2, location="london"),
dict(time=1, name="b", instance=1, location="london"),
dict(time=1, name="b", instance=2, location="london"),
dict(time=1, name="a", instance=1, location="paris"),
dict(time=2, name="a", instance=2, location="paris"),
dict(time=2, name="b", instance=1, location="paris"),
]

似乎时间,名称和实例是你用来索引数据的级别。使用set_index将它们添加到索引中是有意义的:

states = (
pd.DataFrame.from_dict(states)
.set_index(["time", "name", "instance"])
)
locations = (
pd.DataFrame.from_dict(locations)
.set_index(["time", "name", "instance"])
)

一旦MultiIndex就位,您就可以沿着列连接状态和位置。这将在状态列中留下一些nan。首先对索引进行排序,以便首先按时间排列条目,然后根据相同的名称和实例对数据进行分组,最后执行前向填充。

combined = (
pd.concat([states, locations], axis=1)
.sort_index()
.groupby(["name", "instance"])
.ffill()
)

结果与你的略有不同。我得到:

location state
time name instance      
0    a    0        tokyo    0.0
1        tokyo    0.0
2        tokyo    0.0
b    1        tokyo    0.0
2        tokyo    0.0
1    a    0        london   0.0
1        paris    1.0
2        london   0.0
b    1        london   0.0
2        london   0.0
2    a    2        paris    1.0
b    1        paris    1.0

结果:

state_status time name instance location state
0            0    a    0        tokyo    0.0
1            0    a    1        tokyo    0.0
2            0    a    2        tokyo    0.0
3            0    b    1        tokyo    0.0
4            0    b    2        tokyo    0.0
5            1    a    0        london   0.0
6            1    a    1        paris    1.0
7            1    a    2        london   0.0
8            1    b    1        london   0.0
9            1    b    2        london   0.0
10           2    a    0        london   0.0
11           2    a    1        paris    1.0
12           2    a    2        paris    1.0
13           2    b    1        paris    1.0
14           2    b    2        london   0.0

首先,我得到一个MultiIndex数据框架。如果你不喜欢,只要reset_index()。最重要的是,您获得了我没有的额外行,例如当(time, name, instance) = (2, a, 0)时。所有输入数据都没有这种值组合,所以这就是为什么它没有出现在我的结果中。这是你的,因为pivot_table是如何工作的。这可能是一种可取的行为,也可能不是,由你来决定。

相关内容

最新更新