我有一些CSV格式的数据id, time, var
。然后我继续创建了一个多索引数据帧,大致如下的形式
import numpy as np
import pandas as pd
def time(t):
return pd.Timestamp("2019-01-01T12") + pd.to_timedelta(t, "d")
arrays = [
np.array([1, 1, 2, 2, 3, 3]),
np.array([time(0), time(1), time(396), time(365), time(31), time(365)]),
]
df = pd.DataFrame(np.random.randn(6, 1), index=arrays, columns=["var"])
df.index.names = ["id", "time"]
df
var
id time
1 2019-01-01 12:00:00 -0.505903
2019-01-02 12:00:00 0.626197
2 2020-02-01 12:00:00 0.461155
2020-01-01 12:00:00 0.569891
3 2019-02-01 12:00:00 -1.079466
2020-01-01 12:00:00 0.721466
考虑到这一点,我想找到最早进入日期为1月的所有id,然后仅为1月开始的轨迹绘制id表示的轨迹。
注意,我认为时间实际上是排序的,而id不是。不确定这是否会改变什么。
即
df.pseudo_filter(start_month="January")
var
id time
1 2019-01-01 12:00:00 -0.505903
2019-01-02 12:00:00 0.626197
2 2020-02-01 12:00:00 0.461155
2020-01-01 12:00:00 0.569891
您可以按min
time
的月份groupby.filter
df.groupby(level=0).filter(lambda x: x.index.get_level_values(1).min().month == 1)
或
df.groupby(level=0).filter(lambda x: x.index.get_level_values(1).min().month_name() == 'January')
输出:
var
id time
1 2019-01-01 12:00:00 0.410113
2019-01-02 12:00:00 -0.572882
2 2020-02-01 12:00:00 -0.801334
2020-01-01 12:00:00 1.312035
将过滤器作为新功能添加到数据帧
@pd.api.extensions.register_dataframe_accessor("pseudo")
class Pseudo:
def __init__(self, pandas_obj):
self._obj = pandas_obj
def filter(self, start_month):
return (self._obj.groupby(level=0)
.filter(lambda x: x.index.get_level_values(1).min()
.month_name() == start_month))
然后你可以使用
df.pseudo.filter(start_month='January')
输出:
var
id time
1 2019-01-01 12:00:00 -1.314898
2019-01-02 12:00:00 0.810314
2 2020-02-01 12:00:00 -1.214327
2020-01-01 12:00:00 -0.678823