我在试图写出易懂的pandas
时遇到了麻烦,这让我觉得我错过了一些功能或用法(可能是pd.melt
方法)。
我有两个数据集要合并。两者相似:
time
表示状态变化name
和instance
复合标识,用于唯一标识一个事物的记录条目。- 最后一个单一的值命名的状态已经改变了,在那个时候的东西。
所以我想要组合的每个数据集的一个示例记录将是:
dict(time=0, name="a", instance=0, state=1)
dict(time=5, name="a", instance=0, location="london")
我想将这两个记录集合并为一个,其中每个(name, instance)
在每次都有最后已知的state
和location
。
[
dict(time=0, name="a", instance=0, state=1, location=np.nan),
dict(time=5, name="a", instance=0, state=1, location="london"),
]
为了达到这一点,我目前做pd.DataFrame.pivot_table
,pd.DataFrame.ffill
,pd.DataFrame.melt
和pd.DataFrame.reset_index
的组合。它似乎按预期工作,但感觉非常麻烦/不可读,特别是一旦我进入使用pd.DataFrame.melt
。
我觉得我错过了pd.DataFrame.melt
函数的一些用法,但我不太确定如何将文档应用于我正在使用的pd.MultiIndex
列的数据集,或者如果我错过了其他一些pandas
实用程序,我应该使用
如果我发现melt
不是我应该使用的,我会用更合适的东西更新问题标题。
我有:
import pandas as pd
states = [
dict(time=0, name="a", instance=0, state=0),
dict(time=0, name="a", instance=1, state=0),
dict(time=0, name="a", instance=2, state=0),
dict(time=0, name="b", instance=1, state=0),
dict(time=0, name="b", instance=2, state=0),
dict(time=1, name="a", instance=1, state=1),
dict(time=2, name="a", instance=2, state=1),
dict(time=2, name="b", instance=1, state=1),
]
locations = [
dict(time=0, name="a", instance=0, location="tokyo"),
dict(time=0, name="a", instance=1, location="tokyo"),
dict(time=0, name="a", instance=2, location="tokyo"),
dict(time=0, name="b", instance=1, location="tokyo"),
dict(time=0, name="b", instance=2, location="tokyo"),
dict(time=1, name="a", instance=0, location="london"),
dict(time=1, name="a", instance=2, location="london"),
dict(time=1, name="b", instance=1, location="london"),
dict(time=1, name="b", instance=2, location="london"),
dict(time=1, name="a", instance=1, location="paris"),
dict(time=2, name="a", instance=2, location="paris"),
dict(time=2, name="b", instance=1, location="paris"),
]
states = pd.DataFrame.from_dict(states)
locations = pd.DataFrame.from_dict(locations)
combined = pd.concat([states, locations], axis="index")
combined = combined.pivot_table(
index="time",
columns=["name", "instance"],
values=["state", "location"],
aggfunc="last",
)
combined = combined.ffill()
ugly_melt = combined.melt(ignore_index=False)
ugly_melt = ugly_melt.rename(columns={None: "state_status"})
ugly_melt = (
ugly_melt.reset_index()
.pivot(
index=["time", "name", "instance"],
columns=["state_status"],
values="value",
)
.reset_index()
)
print(ugly_melt)
您可以简单地对两个数据集进行连接和过滤,记住time
约束(只能连接以前/当前状态,而不能连接未来状态)。
这样就不需要执行前向填充ffill()
,并且使用连接和过滤器比pd.melt
方法更容易理解。
初始化代码
states = [
dict(time=0, name="a", instance=0, state=0),
dict(time=0, name="a", instance=1, state=0),
dict(time=0, name="a", instance=2, state=0),
dict(time=0, name="b", instance=1, state=0),
dict(time=0, name="b", instance=2, state=0),
dict(time=1, name="a", instance=1, state=1),
dict(time=2, name="a", instance=2, state=1),
dict(time=2, name="b", instance=1, state=1),
]
locations = [
dict(time=0, name="a", instance=0, location="tokyo"),
dict(time=0, name="a", instance=1, location="tokyo"),
dict(time=0, name="a", instance=2, location="tokyo"),
dict(time=0, name="b", instance=1, location="tokyo"),
dict(time=0, name="b", instance=2, location="tokyo"),
dict(time=1, name="a", instance=0, location="london"),
dict(time=1, name="a", instance=2, location="london"),
dict(time=1, name="b", instance=1, location="london"),
dict(time=1, name="b", instance=2, location="london"),
dict(time=1, name="a", instance=1, location="paris"),
dict(time=2, name="a", instance=2, location="paris"),
dict(time=2, name="b", instance=1, location="paris"),
]
实施
import pandas as pd
"""
Steps:
1. Convert to dataframe (Rename state time as state_time, keep location time as time)
2. Merge both dataframe together
3. Filter state time <= location time (since location uses current/previous state)
4. Filter for latest state time (since location must remember the latest state and not all previous states)
"""
# Step 1
states = pd.DataFrame(states).rename(columns={"time": "state_time"})
locations = pd.DataFrame(locations)
# Step 2
merged_df = pd.merge(locations, states, on=["name", "instance"])
# Step 3
merged_df = merged_df[merged_df["state_time"] <= merged_df["time"]]
# Step 4
merged_df = merged_df
.sort_values(["time", "name", "instance", "state_time"])
.drop_duplicates(["time", "name", "instance"], keep="last")
.reset_index(drop=True)
.drop(columns=["state_time"])
结果如下merged_df
time name instance location state
0 0 a 0 tokyo 0
1 0 a 1 tokyo 0
2 0 a 2 tokyo 0
3 0 b 1 tokyo 0
4 0 b 2 tokyo 0
5 1 a 0 london 0
6 1 a 1 paris 1
7 1 a 2 london 0
8 1 b 1 london 0
9 1 b 2 london 0
10 2 a 2 paris 1
11 2 b 1 paris 1
结果的长度遵循location
数据,如果您希望每个名称-实例-位置都有一个时间,您可以事先进行外部连接。
注意combined
在ffill
之后是这样的。
location state
name a b a b
instance 0 1 2 1 2 0 1 2 1 2
time
0 tokyo tokyo tokyo tokyo tokyo 0.0 0.0 0.0 0.0 0.0
1 london paris london london london 0.0 1.0 0.0 0.0 0.0
2 london paris paris paris london 0.0 1.0 1.0 1.0 0.0
有2列,location
和state
。您可以分别melt
和merge
。
melt1 = pd.melt(combined["location"], value_name="location", ignore_index=False).reset_index()
melt2 = pd.melt(combined["state"], value_name="state", ignore_index=False).reset_index()
better_melt = melt1.merge(melt2, on=["time", "name", "instance"])
better_melt
time name instance location state
0 0 a 0 tokyo 0.0
1 1 a 0 london 0.0
2 2 a 0 london 0.0
3 0 a 1 tokyo 0.0
4 1 a 1 paris 1.0
5 2 a 1 paris 1.0
6 0 a 2 tokyo 0.0
7 1 a 2 london 0.0
8 2 a 2 paris 1.0
9 0 b 1 tokyo 0.0
10 1 b 1 london 0.0
11 2 b 1 paris 1.0
12 0 b 2 tokyo 0.0
13 1 b 2 london 0.0
14 2 b 2 london 0.0
数据集:
import pandas as pd
states = [
dict(time=0, name="a", instance=0, state=0),
dict(time=0, name="a", instance=1, state=0),
dict(time=0, name="a", instance=2, state=0),
dict(time=0, name="b", instance=1, state=0),
dict(time=0, name="b", instance=2, state=0),
dict(time=1, name="a", instance=1, state=1),
dict(time=2, name="a", instance=2, state=1),
dict(time=2, name="b", instance=1, state=1),
]
locations = [
dict(time=0, name="a", instance=0, location="tokyo"),
dict(time=0, name="a", instance=1, location="tokyo"),
dict(time=0, name="a", instance=2, location="tokyo"),
dict(time=0, name="b", instance=1, location="tokyo"),
dict(time=0, name="b", instance=2, location="tokyo"),
dict(time=1, name="a", instance=0, location="london"),
dict(time=1, name="a", instance=2, location="london"),
dict(time=1, name="b", instance=1, location="london"),
dict(time=1, name="b", instance=2, location="london"),
dict(time=1, name="a", instance=1, location="paris"),
dict(time=2, name="a", instance=2, location="paris"),
dict(time=2, name="b", instance=1, location="paris"),
]
似乎时间,名称和实例是你用来索引数据的级别。使用set_index
将它们添加到索引中是有意义的:
states = (
pd.DataFrame.from_dict(states)
.set_index(["time", "name", "instance"])
)
locations = (
pd.DataFrame.from_dict(locations)
.set_index(["time", "name", "instance"])
)
一旦MultiIndex就位,您就可以沿着列连接状态和位置。这将在状态列中留下一些nan。首先对索引进行排序,以便首先按时间排列条目,然后根据相同的名称和实例对数据进行分组,最后执行前向填充。
combined = (
pd.concat([states, locations], axis=1)
.sort_index()
.groupby(["name", "instance"])
.ffill()
)
结果与你的略有不同。我得到:
location state
time name instance
0 a 0 tokyo 0.0
1 tokyo 0.0
2 tokyo 0.0
b 1 tokyo 0.0
2 tokyo 0.0
1 a 0 london 0.0
1 paris 1.0
2 london 0.0
b 1 london 0.0
2 london 0.0
2 a 2 paris 1.0
b 1 paris 1.0
结果:
state_status time name instance location state
0 0 a 0 tokyo 0.0
1 0 a 1 tokyo 0.0
2 0 a 2 tokyo 0.0
3 0 b 1 tokyo 0.0
4 0 b 2 tokyo 0.0
5 1 a 0 london 0.0
6 1 a 1 paris 1.0
7 1 a 2 london 0.0
8 1 b 1 london 0.0
9 1 b 2 london 0.0
10 2 a 0 london 0.0
11 2 a 1 paris 1.0
12 2 a 2 paris 1.0
13 2 b 1 paris 1.0
14 2 b 2 london 0.0
首先,我得到一个MultiIndex数据框架。如果你不喜欢,只要reset_index()
。最重要的是,您获得了我没有的额外行,例如当(time, name, instance) = (2, a, 0)时。所有输入数据都没有这种值组合,所以这就是为什么它没有出现在我的结果中。这是你的,因为pivot_table
是如何工作的。这可能是一种可取的行为,也可能不是,由你来决定。