Pandas查找时间索引在特定月份开始的id索引



我有一些CSV格式的数据id, time, var。然后我继续创建了一个多索引数据帧,大致如下的形式

import numpy as np
import pandas as pd
def time(t):
return pd.Timestamp("2019-01-01T12") + pd.to_timedelta(t, "d")

arrays = [
np.array([1, 1, 2, 2, 3, 3]),
np.array([time(0), time(1), time(396), time(365), time(31), time(365)]),
]
df = pd.DataFrame(np.random.randn(6, 1), index=arrays, columns=["var"])
df.index.names = ["id", "time"]
df
var
id  time
1   2019-01-01 12:00:00   -0.505903
2019-01-02 12:00:00    0.626197
2   2020-02-01 12:00:00    0.461155
2020-01-01 12:00:00    0.569891
3   2019-02-01 12:00:00   -1.079466
2020-01-01 12:00:00    0.721466

考虑到这一点,我想找到最早进入日期为1月的所有id,然后仅为1月开始的轨迹绘制id表示的轨迹。

注意,我认为时间实际上是排序的,而id不是。不确定这是否会改变什么。

df.pseudo_filter(start_month="January")
var
id  time
1   2019-01-01 12:00:00   -0.505903
2019-01-02 12:00:00    0.626197
2   2020-02-01 12:00:00    0.461155
2020-01-01 12:00:00    0.569891

您可以按mintime的月份groupby.filter

df.groupby(level=0).filter(lambda x: x.index.get_level_values(1).min().month == 1)

df.groupby(level=0).filter(lambda x: x.index.get_level_values(1).min().month_name() == 'January')

输出:

var
id time                         
1  2019-01-01 12:00:00  0.410113
2019-01-02 12:00:00 -0.572882
2  2020-02-01 12:00:00 -0.801334
2020-01-01 12:00:00  1.312035

将过滤器作为新功能添加到数据帧

@pd.api.extensions.register_dataframe_accessor("pseudo")
class Pseudo:
def __init__(self, pandas_obj):
self._obj = pandas_obj
def filter(self, start_month):
return (self._obj.groupby(level=0)
.filter(lambda x: x.index.get_level_values(1).min()
.month_name() == start_month))

然后你可以使用

df.pseudo.filter(start_month='January')

输出:

var
id time                         
1  2019-01-01 12:00:00 -1.314898
2019-01-02 12:00:00  0.810314
2  2020-02-01 12:00:00 -1.214327
2020-01-01 12:00:00 -0.678823

最新更新