仅当panda级数介于两倍之间时,才将其设置为特定值



如果数据帧列long位于开始时间和结束时间之间,我将尝试将其值设置为0。有人能让我理解为什么前两种方法不起作用,而最后一种起作用吗?

start_time, end_time = '9:30', '9:40'
data.between_time(start_time, end_time)['long'] = 0    (will not work)
data.loc[data.between_time(start_time, end_time).index]['long'] = 0 (will not work)
data['long'].loc[data.between_time(start_time, end_time).index] = 0 (will work)

此外,如果有比上述选项3更快的方法,请告诉我。

这更像是一个教学问题。在我理想的世界里,一种方法会起作用,因为它似乎是最简洁的。

第一个想法是通过DatetimeIndex.indexer_between_time获取位置,通过DataFrame.iloc设置值,因此通过Index.get_loc:获取列long的必要位置

idx = data.index.indexer_between_time(start_time, end_time)
data.iloc[idx, data.columns.get_loc('long')] = 0

类似于您的解决方案是使用DataFrame.loc:

df = data.between_time(start_time, end_time)
data.loc[df.index, 'long'] = 0

具有示例数据的1M行的性能相似,但应避免使用您的解决方案,因为可能存在SettingWithCopyWarning:

i = pd.date_range('2000-01-01', freq='H', periods=1000000)
N = len(i)
data = pd.DataFrame({'long':range(N)}, index=i)
start_time, end_time = '9:30', '9:40'
In [287]: %timeit data['long'].loc[data.between_time(start_time, end_time).index] = 0
102 ms ± 4.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [289]: %timeit data.iloc[data.index.indexer_between_time(start_time, end_time), data.columns.get_loc('long')] = 0
96.8 ms ± 856 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [291]: %timeit data.loc[data.between_time(start_time, end_time).index, 'long'] = 0
97.5 ms ± 1.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

最新更新