我有以下df,我想将数据拆分为行程。
In: df = pd.DataFrame([[1001,0.054012973,0],[1001,0.44923679,12],[1001,0,1],[1001,0,1],[1001,0.44676617,2],
[1001,1.8310822,1],[1001,0,1],[1001,0,11],[1001,0,1],[1001,0,20],[1001,0,1],[1001,0,54],[1001,10.0604029,2],
[1001,11.642113,0],[1001,0,1],[1002,0,2],[1002,1.23463449,23],[1002,1.8310822,1],[1002,0,1]],
columns=['Dev_ID','Speed','Duration'])
out: Dev_ID Speed Duration
0 1001 0.054013 0
1 1001 0.449237 12
2 1001 0.000000 1
3 1001 0.000000 1
4 1001 0.446766 2
5 1001 1.831082 1
6 1001 0.000000 1
7 1001 0.000000 11
8 1001 0.000000 1
9 1001 0.000000 20
10 1001 0.000000 1
11 1001 0.000000 54
12 1001 10.060403 2
13 1001 11.642113 0
14 1001 0.000000 1
15 1002 0.000000 2
16 1002 1.234634 23
17 1002 1.831082 1
18 1002 0.000000 1
拆分的标准是速度值0比120秒长。因此,我必须查找每个dev_ID,并以某种方式检查是否存在持续时间超过120秒的连续零。如果条件为真,我想删除这些行(其中零持续时间超过了120秒(,并在trip_ID列中启动一个新的ID。所以结果应该是这样的:
Dev_ID Speed Duration Trip_ID
0 1001 0.054013 0 10
1 1001 0.449237 12 10
2 1001 0.000000 1 10
3 1001 0.000000 1 10
4 1001 0.446766 2 10
5 1001 1.831082 1 10
6 1001 10.060403 2 11
7 1001 11.642113 0 11
8 1001 0.000000 1 11
9 1002 0.000000 2 12
10 1002 1.234634 23 12
11 1002 1.831082 1 12
12 1002 0.000000 1 12
我不完全确定我是否理解这个条件,但我制作了一个通用代码,希望它能相似,并且您可以进行调整。
关键思想是:使用pd.shift((获取差异,使用np.where获取速度差异序列为0的索引列表,使用get_contigous_index将这些索引拆分为连续组,然后对于每个连续组,如果持续时间总和>120,则更改"Trip_id">
我假设你的持续时间是分钟,否则没有一个间隔会超过120
import pandas as pd
import numpy as np
from itertools import groupby
from operator import itemgetter
df = pd.DataFrame([[1001,0.054012973,0],[1001,0.44923679,12],[1001,0,1],[1001,0,1],[1001,0.44676617,2],
[1001,1.8310822,1],[1001,0,1],[1001,0,11],[1001,0,1],[1001,0,20],[1001,0,1],[1001,0,54],[1001,10.0604029,2],
[1001,11.642113,0],[1001,0,1],[1002,0,2],[1002,1.23463449,23],[1002,1.8310822,1],[1002,0,1]],
columns=['Dev_ID','Speed','Duration'])
df['Duration'] = df['Duration']*60
df['Trip_ID'] = df['Dev_ID']
def get_contigous_index(indexes):
ranges = []
for k,g in groupby(enumerate(indexes),lambda x:x[0]-x[1]):
group = (map(itemgetter(1),g))
group = list(map(int,group))
ranges.append((group[0],group[-1]))
return ranges
for Dev_ID, data in df.groupby("Dev_ID"):
data['speed_diff'] = data['Speed'] - data['Speed'].shift(1)
diff_0 = np.where(data['speed_diff'] == 0)[0]
for contigousZeroes_range in get_contigous_index(diff_0):
fst_idx, lst_idx = list(contigousZeroes_range)
range_ = list(range(fst_idx,lst_idx+1))
subgroup = data.loc[range_ ,data.columns]
if not subgroup.empty:
if subgroup['Duration'].sum() > 120:
df.loc[range_,'Trip_ID'] = "a_different_id"
print(df)
这将打印这样的数据帧:
Dev_ID Speed Duration Trip_ID
0 1001 0.054013 0 1001
1 1001 0.449237 720 1001
2 1001 0.000000 60 1001
3 1001 0.000000 60 1001
4 1001 0.446766 120 1001
5 1001 1.831082 60 1001
6 1001 0.000000 60 1001
7 1001 0.000000 660 a_different_id
8 1001 0.000000 60 a_different_id
9 1001 0.000000 1200 a_different_id
10 1001 0.000000 60 a_different_id
11 1001 0.000000 3240 a_different_id
12 1001 10.060403 120 1001
13 1001 11.642113 0 1001
14 1001 0.000000 60 1001
15 1002 0.000000 120 1002
16 1002 1.234634 1380 1002
17 1002 1.831082 60 1002
18 1002 0.000000 60 1002
基于@Dataman建议的解决方案(非常感谢(,对我有效的代码是:
for Dev_ID, data in df.groupby("Dev_ID"):
for k, g in groupby(data.iterrows(), lambda x: x[1]['Speed']): #group consecutive speeds
l = list(g)
if l[0][1]['Speed'] == 0: # check if the consective speeds are zeros
dur = sum(x[1]['Duration'] for x in l) # calculate how long speed 0 lasts
if dur>120:
zeros_idx.append([x[0] for x in l]) # save indexes where speed = 0 for long time
df.drop((item for sublist in zeros_idx for item in sublist),axis=0,inplace=True) #delete long stops