使用 Pandas 为时间戳列表创建时隙



>我有一列熊猫日期时间64类型元素

df['time']
0         2019-10-04 12:03:53+00:00
1         2019-10-04 11:21:23+00:00
2         2019-10-04 12:23:11+00:00
3         2019-10-04 18:04:52+00:00
4         2019-10-04 12:22:21+00:00
...           
2889974   2019-10-11 10:53:19+00:00
2889975   2019-10-11 10:58:38+00:00
2889976   2019-10-10 10:36:47+00:00
2889977   2019-10-10 10:36:47+00:00
2889978   2019-07-08 04:36:45+00:00
Name: time, Length: 2889979, dtype: datetime64[ns, UTC]

和一列相应的时间戳,如下所示df['time_full'];

df['time_full']
0          12:03:53
1          11:21:23
2          12:23:11
3          18:04:52
4          12:22:21
...   
2889974    10:53:19
2889975    10:58:38
2889976    10:36:47
2889977    10:36:47
2889978    04:36:45
Name: time_full, Length: 2889979, dtype: object

我想在一天中创建 30 分钟的插槽(基本上是 48 个插槽(,并为df['time']列中的所有值分配一个插槽。基本上,创建一个时间戳的分类变量。像这样的东西(只是一个例子(:

df['time']                             df['slot']
0         2019-10-04 12:03:53+00:00    4
1         2019-10-04 11:21:23+00:00    2
2         2019-10-04 12:23:11+00:00    32
3         2019-10-04 18:04:52+00:00    40
4         2019-10-04 12:22:21+00:00    5

我尝试使用 Pandas 的 pd.cut(( 方法对插槽进行分箱,如下所示,最终这样做:

pd.cut(df['time'].astype(np.int64)//10**9,
bins=pd.date_range("00:00", "23:59", freq="30min"))

但是得到的输出看起来像:

0          NaN
1          NaN
2          NaN
3          NaN
4          NaN
... 
2889974    NaN
2889975    NaN
2889976    NaN
2889977    NaN
2889978    NaN
Name: time, Length: 2889979, dtype: category
Categories (47, interval[int64]): [(1575331200000000000, 1575333000000000000] < (1575333000000000000, 1575334800000000000] < (1575334800000000000, 1575336600000000000] < (1575336600000000000, 1575338400000000000] ... (1575408600000000000, 1575410400000000000] < (1575410400000000000, 1575412200000000000] < (1575412200000000000, 1575414000000000000] < (1575414000000000000, 1575415800000000000]]

我也尝试使用df['time_full']作为垃圾箱,但它抛出了一个错误,因为它是一个字符串列表。我认为问题是 df['time'] 在同时具有日期和时间时不适合分箱,但我不太确定。任何帮助将不胜感激。

如果您希望插槽范围从 0 到 47,您可以使用:

df['slots'] = df['time'].apply(lambda x: x.hour*2 if x.minute <= 29 else x.hour*2+1)
df
time  slots
0       2019-10-04 12:03:53+00:00     24
1       2019-10-04 11:21:23+00:00     22
2       2019-10-04 12:23:11+00:00     24
3       2019-10-04 18:04:52+00:00     36
4       2019-10-04 12:22:21+00:00     24
2889974 2019-10-11 10:53:19+00:00     21
2889975 2019-10-11 10:58:38+00:00     21
2889976 2019-10-10 10:36:47+00:00     21
2889977 2019-10-10 10:36:47+00:00     21
2889978 2019-07-08 04:36:45+00:00      9

进一步测试:

date  slots
0  2019-10-04 00:00:00      0
1  2019-10-04 00:30:00      1
2  2019-10-04 01:00:00      2
3  2019-10-04 01:30:00      3
4  2019-10-04 02:00:00      4
5  2019-10-04 02:30:00      5
6  2019-10-04 03:00:00      6
7  2019-10-04 03:30:00      7
8  2019-10-04 04:00:00      8
9  2019-10-04 04:30:00      9
10 2019-10-04 05:00:00     10
11 2019-10-04 05:30:00     11
12 2019-10-04 06:00:00     12
13 2019-10-04 06:30:00     13
14 2019-10-04 07:00:00     14
15 2019-10-04 07:30:00     15
16 2019-10-04 08:00:00     16
17 2019-10-04 08:30:00     17
18 2019-10-04 09:00:00     18
19 2019-10-04 09:30:00     19
20 2019-10-04 10:00:00     20
21 2019-10-04 10:30:00     21
22 2019-10-04 11:00:00     22
23 2019-10-04 11:30:00     23
24 2019-10-04 12:00:00     24
25 2019-10-04 12:30:00     25
26 2019-10-04 13:00:00     26
27 2019-10-04 13:30:00     27
28 2019-10-04 14:00:00     28
29 2019-10-04 14:30:00     29
30 2019-10-04 15:00:00     30
31 2019-10-04 15:30:00     31
32 2019-10-04 16:00:00     32
33 2019-10-04 16:30:00     33
34 2019-10-04 17:00:00     34
35 2019-10-04 17:30:00     35
36 2019-10-04 18:00:00     36
37 2019-10-04 18:30:00     37
38 2019-10-04 19:00:00     38
39 2019-10-04 19:30:00     39
40 2019-10-04 20:00:00     40
41 2019-10-04 20:30:00     41
42 2019-10-04 21:00:00     42
43 2019-10-04 21:30:00     43
44 2019-10-04 22:00:00     44
45 2019-10-04 22:30:00     45
46 2019-10-04 23:00:00     46
47 2019-10-04 23:30:00     47

如果要将插槽范围从 1 到 48:

df['slots'] = df['time'].apply(lambda x: x.hour*2+1 if x.minute <= 29 else x.hour*2+2)

取决于您希望如何获得值。 检查如何将日期时间索引重采样器转换为数据帧?

df = pd.DataFrame(pd.date_range('2000-01-02', freq='15min', periods=15), columns=['time'])
df.set_index(df['time'], inplace=True)
df=df.resample('30min').interpolate()
df

最新更新