熊猫按 2 列分组,使用另一列查找增量



我有一个熊猫数据帧,它有4909144行,time作为索引,source_namedest_addresstvaluetime索引相同。 我使用以下方法按source_namedest_addresstvalue对 df 进行排序,以便按时间顺序对它们进行分组:

df = df.sort_values(by=['sourcehostname','destinationaddress','tvalue'])

这给了我:

source_name  dest_address   tvalue                 
time                
2019-02-06 15:00:54.000 source_1     72.21.215.90   2019-02-06 15:00:54.000 
2019-02-06 15:01:00.000 source_1     72.21.215.90   2019-02-06 15:01:00.000 
2019-02-06 15:30:51.000 source_1     72.21.215.90   2019-02-06 15:30:51.000 
2019-02-06 15:30:51.000 source_1     72.21.215.90   2019-02-06 15:30:51.000 
2019-02-06 15:00:54.000 source_1     131.107.0.89   2019-02-06 15:00:54.000 
2019-02-06 15:01:14.000 source_1     131.107.0.89   2019-02-06 15:01:14.000 
2019-02-06 15:03:02.000 source_2     69.63.191.1    2019-02-06 15:03:02.000 
2019-02-06 15:08:02.000 source_2     69.63.191.1    2019-02-06 15:08:02.000 

我想要时间之间的差异,所以我使用:

#Create delta
df['delta'] = (df['tvalue']-df['tvalue'].shift()).fillna(0)

这给了我:

source_name  dest_address   tvalue                 delta
time                
2019-02-06 15:00:54.000 source_1     72.21.215.90   2019-02-06 15:00:54.000 00:00:00
2019-02-06 15:01:00.000 source_1     72.21.215.90   2019-02-06 15:01:00.000 00:00:06
2019-02-06 15:30:51.000 source_1     72.21.215.90   2019-02-06 15:30:51.000 00:29:51
2019-02-06 15:30:51.000 source_1     72.21.215.90   2019-02-06 15:30:51.000 00:00:00
2019-02-06 15:00:54.000 source_1     131.107.0.89   2019-02-06 15:00:54.000 -1 days +23:30:03
2019-02-06 15:01:14.000 source_1     131.107.0.89   2019-02-06 15:01:14.000 00:00:20
2019-02-06 15:03:02.000 source_2     69.63.191.1    2019-02-06 15:03:02.000 00:01:48
2019-02-06 15:08:02.000 source_2     69.63.191.1    2019-02-06 15:08:02.000 00:05:00

但是我想按source_namedest_address分组,并在tvalue中得到差异,这样我就不会在第一个source_2条目之后遇到像-1 days +23:30:00这样的delta或像00:01:48这样的delta,而它应该是00:00:00的。

我正在尝试:

df.groupby(['sourcehostname','destinationaddress'])['tvalue'].diff().fillna(0)

但这需要很长时间,可能无法为我提供我正在寻找的结果。

以下内容不起作用,但您可以像我的原始代码一样做一些事情,但添加一个组依据?

#Create delta
df['delta'] = df.groupby(['sourcehostname','destinationaddress'])(df['tvalue']-df['tvalue'].shift()).fillna(0)
import datetime as dt
source_changed = df['sourcehostname'] != df['sourcehostname'].shift()
dest_changed = df['destinationaddress'] != df['destinationaddress'].shift()
change_occurred = (source_changed | dest_changed)
time_diff = df['tvalue'].diff()
now = dt.datetime.utcnow()
zero_delta = now - now
df['time_diff'] = time_diff
df['change_occurred'] = change_occurred
# Then do a function
# If df['change_occurred'] is True -> set the value of df['delta'] to zero_delta  
# Else set df['delta'] to the value at df['time_dff']

相关内容

  • 没有找到相关文章

最新更新