pd.cut with datetime IntervalIndex as bins



从下面的代码中,我希望这些时间戳被装入通过IntervalIndex提供的周期中。不幸的是,我只收到了返回的NaN。拜托,怎么了?

import pandas as pd
# Test data
ts = [pd.Timestamp('2022/03/01 09:00'),
pd.Timestamp('2022/03/01 10:00'),
pd.Timestamp('2022/03/01 10:30'),
pd.Timestamp('2022/03/01 15:00')]
df = pd.DataFrame({'a':range(len(ts)), 'ts': ts})
# Test
bins = pd.interval_range(pd.Timestamp('2022/03/01 08:00'),
pd.Timestamp('2022/03/01 16:00'),
freq='2H',
closed="left")
row_labels = pd.cut(df["ts"], bins)

我预计结果是:

[2022-03-01 08:00:00, 2022-03-01 10:00:00)
[2022-03-01 10:00:00, 2022-03-01 12:00:00)
[2022-03-01 10:00:00, 2022-03-01 12:00:00)
[2022-03-01 14:00:00, 2022-03-01 16:00:00)

但我只得到NaN

row_labels
Out[37]: 
0    NaN
1    NaN
2    NaN
3    NaN
Name: ts, dtype: category
Categories (4, interval[datetime64[ns], left]): [ <
[2022-03-01 08:00:00, 2022-03-01 10:00:00) <
[2022-03-01 10:00:00, 2022-03-01 12:00:00) <
[2022-03-01 12:00:00, 2022-03-01 14:00:00) <
[2022-03-01 14:00:00, 2022-03-01 16:00:00)]

拜托,怎么了?谢谢你的帮助。最佳,

非常有趣

pd.cut(df['ts'].to_list(), bins)

产生预期结果

[[2022-03-01 08:00:00, 2022-03-01 10:00:00), 
[2022-03-01 10:00:00, 2022-03-01 12:00:00), 
[2022-03-01 10:00:00, 2022-03-01 12:00:00), 
[2022-03-01 14:00:00, 2022-03-01 16:00:00)]
Categories (4, interval[datetime64[ns], left]): [
[2022-03-01 08:00:00, 2022-03-01 10:00:00) < 
[2022-03-01 10:00:00, 2022-03-01 12:00:00) < 
[2022-03-01 12:00:00, 2022-03-01 14:00:00) < 
[2022-03-01 14:00:00, 2022-03-01 16:00:00)]

但是

pd.cut(df['ts'].to_numpy(), bins)
[NaN, NaN, NaN, NaN]
Categories (4, interval[datetime64[ns], left]): [
[2022-03-01 08:00:00, 2022-03-01 10:00:00) < 
[2022-03-01 10:00:00, 2022-03-01 12:00:00) < 
[2022-03-01 12:00:00, 2022-03-01 14:00:00) < 
[2022-03-01 14:00:00, 2022-03-01 16:00:00)]

什么

为什么它适用于列表,但不适用于np.ndarraypd.Series


另外:

bins_dt_index = pd.date_range(pd.Timestamp('2022/03/01 08:00'),
pd.Timestamp('2022/03/01 16:00'),
freq='2H')
bins_dt_index
DatetimeIndex(['2022-03-01 08:00:00', '2022-03-01 10:00:00',
'2022-03-01 12:00:00', '2022-03-01 14:00:00',
'2022-03-01 16:00:00'],
dtype='datetime64[ns]', freq='2H')
pd.cut(df['ts'].to_list(), bins_dt_index, right=False)

产生

TypeError: '<' not supported between instances of 'int' and 'Timestamp'

同时

pd.cut(df['ts'], bins_dt_index, right=False)

产生预期的结果!

0    [2022-03-01 08:00:00, 2022-03-01 10:00:00)
1    [2022-03-01 10:00:00, 2022-03-01 12:00:00)
2    [2022-03-01 10:00:00, 2022-03-01 12:00:00)
3    [2022-03-01 14:00:00, 2022-03-01 16:00:00)
Name: ts, dtype: category
Categories (4, interval[datetime64[ns], left]): [
[2022-03-01 08:00:00, 2022-03-01 10:00:00) < 
[2022-03-01 10:00:00, 2022-03-01 12:00:00) < 
[2022-03-01 12:00:00, 2022-03-01 14:00:00) < 
[2022-03-01 14:00:00, 2022-03-01 16:00:00)]

所以DatetimeIndex可以与np.ndarraypd.Series一起使用,但不能与列表一起使用!

IntervalIndex,反之亦然!

它们不应该都一样工作吗?我的意思是,pd.cut清楚地表明,x可以是一维的array-like

如果有人解释为什么会发生这种情况,那就太好了