在numpy (pandas)中合并临时靠近另一个的事件



我想简化开始和停止时间的列表。当一个结束和另一个开始之间的时间,我想合并(行)。以下是我的数据的简化,以及我想要作为输出的内容:

import numpy as np
import pandas as pd
start_time = [  1,  7, 20, 22, 27, 35]
stop_time  = [  5,  9, 22, 26, 30, 40]
events = pd.DataFrame({'start_time': start_time, 'stop_time': stop_time})
allowable_gap = 2.0
desired_start_time = [  1, 20, 35]
desired_stop_time  = [  9, 30, 40]
desired_events = pd.DataFrame({'start_time':desired_start_time, 'stop_time':desired_stop_time})

我没有要求我必须使用Pandas。然而,我至少需要使用numpy。事件个数按1e6的顺序排列。

感谢任何实现或指导。我知道我的部分问题是我不"了解"熊猫。

我的用法可能与解决方案无关。作为背景,我正在收集大量事件,然后使用matplotlib.pyplot绘制它们。由于输出很复杂,我发现最好的格式是.svg。IE通常渲染得很好,但需要很长时间才能做到这一点,我希望减少它必须绘制的线条数量。我很想以一种更好的方式来看待时间序列,但那超出了这个问题的范围。

更有效的方法:

In [106]: (events.groupby((events.start_time - events.stop_time.shift() > allowable_gap).cumsum())
   .....:        .agg({'start_time':'min', 'stop_time':'max'})[['start_time','stop_time']])
Out[106]:
   start_time  stop_time
0           1          9
1          20         30
2          35         40

按60K行计时DF:

In [129]: events = pd.concat([events] * 10**4, ignore_index=True)
In [130]: events.shape
Out[130]: (60000, 2)
In [131]: %paste
def f():
    desired_start_time = []
    desired_stop_time  = []
    start = None
    end = None
    for index, row in events.iterrows():
        if start == None and end == None:
            start = row['start_time']
            end = row['stop_time']
        else:
            if end + allowable_gap >= row['start_time']:
                end = row['stop_time']
            else:
                desired_start_time.append(start)
                desired_stop_time.append(end)
                start = row['start_time']
                end = row['stop_time']
    desired_start_time.append(start)
    desired_stop_time.append(end)
## -- End pasted text --
In [132]: %timeit f()
1 loop, best of 3: 16.1 s per loop
In [133]: %%timeit
   .....: (events.groupby((events.start_time - events.stop_time.shift() > allowable_gap).cumsum())
   .....:        .agg({'start_time':'min', 'stop_time':'max'})[['start_time','stop_time']])
   .....:
100 loops, best of 3: 16.9 ms per loop

结论:"循环"解是近似的。慢1000倍

另一个时间为6M行DF:

In [153]: events = pd.concat([events] * 10**6, ignore_index=True)
In [154]: events.shape
Out[154]: (6000000, 2)
In [155]: %%timeit
   .....: (events.groupby((events.start_time - events.stop_time.shift() > allowable_gap).cumsum())
   .....:        .agg({'start_time':'min', 'stop_time':'max'})[['start_time','stop_time']])
   .....:
1 loop, best of 3: 1.49 s per loop

给定和期望的DFs:

In [98]: events
Out[98]:
   start_time  stop_time
0           1          5
1           7          9
2          20         22
3          22         26
4          27         30
5          35         40
In [99]: desired_events
Out[99]:
   start_time  stop_time
0           1          9
1          20         30
2          35         40

解释:

In [107]: events.start_time - events.stop_time.shift()
Out[107]:
0     NaN
1     2.0
2    11.0
3     0.0
4     1.0
5     5.0
dtype: float64
In [108]: (events.start_time - events.stop_time.shift() > allowable_gap)
Out[108]:
0    False
1    False
2     True
3    False
4    False
5     True
dtype: bool
In [109]: (events.start_time - events.stop_time.shift() > allowable_gap).cumsum()
Out[109]:
0    0
1    0
2    1
3    1
4    1
5    2
dtype: int32

此方案使用DataFrame.iterrows()函数。
我做了这样的假设:

  • start_time <=所有事件的停止时间
import numpy as np
import pandas as pd
start_time = [  1,  7, 20, 22, 27, 35]
stop_time  = [  5,  9, 22, 26, 30, 40]
events = pd.DataFrame({'start_time': start_time, 'stop_time': stop_time})
allowable_gap = 2.0
desired_start_time = []
desired_stop_time  = []
start = None
end = None
for index, row in events.iterrows():
    if start == None and end == None:
        start = row['start_time']
        end = row['stop_time']
    else:
        if end + allowable_gap >= row['start_time']:
            end = row['stop_time']
        else:
            desired_start_time.append(start)
            desired_stop_time.append(end)
            start = row['start_time']
            end = row['stop_time']
desired_start_time.append(start)
desired_stop_time.append(end)
print(desired_start_time)
print(desired_stop_time)

输出:

[1,20,35]
[9,30,40]

最新更新