Python:识别列中的连续零,删除它们的行并开始新的编号



我有以下df,我想将数据拆分为行程。

In: df = pd.DataFrame([[1001,0.054012973,0],[1001,0.44923679,12],[1001,0,1],[1001,0,1],[1001,0.44676617,2],
[1001,1.8310822,1],[1001,0,1],[1001,0,11],[1001,0,1],[1001,0,20],[1001,0,1],[1001,0,54],[1001,10.0604029,2],
[1001,11.642113,0],[1001,0,1],[1002,0,2],[1002,1.23463449,23],[1002,1.8310822,1],[1002,0,1]],
columns=['Dev_ID','Speed','Duration'])
out:    Dev_ID  Speed   Duration
0   1001    0.054013    0
1   1001    0.449237    12
2   1001    0.000000    1
3   1001    0.000000    1
4   1001    0.446766    2
5   1001    1.831082    1
6   1001    0.000000    1
7   1001    0.000000    11
8   1001    0.000000    1
9   1001    0.000000    20
10  1001    0.000000    1
11  1001    0.000000    54
12  1001    10.060403   2
13  1001    11.642113   0
14  1001    0.000000    1
15  1002    0.000000    2
16  1002    1.234634    23
17  1002    1.831082    1
18  1002    0.000000    1

拆分的标准是速度值0比120秒长。因此,我必须查找每个dev_ID,并以某种方式检查是否存在持续时间超过120秒的连续零。如果条件为真,我想删除这些行(其中零持续时间超过了120秒(,并在trip_ID列中启动一个新的ID。所以结果应该是这样的:

Dev_ID  Speed   Duration    Trip_ID
0   1001    0.054013    0   10
1   1001    0.449237    12  10
2   1001    0.000000    1   10
3   1001    0.000000    1   10
4   1001    0.446766    2   10
5   1001    1.831082    1   10
6   1001    10.060403   2   11
7   1001    11.642113   0   11
8   1001    0.000000    1   11
9   1002    0.000000    2   12
10  1002    1.234634    23  12
11  1002    1.831082    1   12
12  1002    0.000000    1   12

我不完全确定我是否理解这个条件,但我制作了一个通用代码,希望它能相似,并且您可以进行调整。

关键思想是:使用pd.shift((获取差异,使用np.where获取速度差异序列为0的索引列表,使用get_contigous_index将这些索引拆分为连续组,然后对于每个连续组,如果持续时间总和>120,则更改"Trip_id">

我假设你的持续时间是分钟,否则没有一个间隔会超过120

import pandas as pd
import numpy as np
from itertools import groupby
from operator import itemgetter
df = pd.DataFrame([[1001,0.054012973,0],[1001,0.44923679,12],[1001,0,1],[1001,0,1],[1001,0.44676617,2],
[1001,1.8310822,1],[1001,0,1],[1001,0,11],[1001,0,1],[1001,0,20],[1001,0,1],[1001,0,54],[1001,10.0604029,2],
[1001,11.642113,0],[1001,0,1],[1002,0,2],[1002,1.23463449,23],[1002,1.8310822,1],[1002,0,1]],
columns=['Dev_ID','Speed','Duration'])
df['Duration'] = df['Duration']*60
df['Trip_ID'] = df['Dev_ID']
def get_contigous_index(indexes):
ranges = []
for k,g in groupby(enumerate(indexes),lambda x:x[0]-x[1]):
group = (map(itemgetter(1),g))
group = list(map(int,group))
ranges.append((group[0],group[-1]))
return ranges
for Dev_ID, data in df.groupby("Dev_ID"):
data['speed_diff'] = data['Speed'] - data['Speed'].shift(1)
diff_0 = np.where(data['speed_diff'] == 0)[0]
for contigousZeroes_range in get_contigous_index(diff_0):
fst_idx, lst_idx = list(contigousZeroes_range)
range_ = list(range(fst_idx,lst_idx+1))
subgroup = data.loc[range_ ,data.columns]
if not subgroup.empty:
if subgroup['Duration'].sum() > 120:
df.loc[range_,'Trip_ID'] = "a_different_id"
print(df)

这将打印这样的数据帧:

Dev_ID      Speed  Duration         Trip_ID
0     1001   0.054013         0            1001
1     1001   0.449237       720            1001
2     1001   0.000000        60            1001
3     1001   0.000000        60            1001
4     1001   0.446766       120            1001
5     1001   1.831082        60            1001
6     1001   0.000000        60            1001
7     1001   0.000000       660  a_different_id
8     1001   0.000000        60  a_different_id
9     1001   0.000000      1200  a_different_id
10    1001   0.000000        60  a_different_id
11    1001   0.000000      3240  a_different_id
12    1001  10.060403       120            1001
13    1001  11.642113         0            1001
14    1001   0.000000        60            1001
15    1002   0.000000       120            1002
16    1002   1.234634      1380            1002
17    1002   1.831082        60            1002
18    1002   0.000000        60            1002

基于@Dataman建议的解决方案(非常感谢(,对我有效的代码是:

for Dev_ID, data in df.groupby("Dev_ID"): 
for k, g in groupby(data.iterrows(), lambda x: x[1]['Speed']): #group consecutive speeds
l = list(g)
if l[0][1]['Speed'] == 0: # check if the consective speeds are zeros
dur = sum(x[1]['Duration'] for x in l) # calculate how long speed 0 lasts
if dur>120:
zeros_idx.append([x[0] for x in l]) # save indexes where speed = 0 for long time
df.drop((item for sublist in zeros_idx for item in sublist),axis=0,inplace=True) #delete long stops

最新更新