我已经阅读了很多关于如何在熊猫数据帧,但我很难弄清楚如何应用就我的情况而言。我有一个包含车辆行程数据的数据框架。因此,在给定的一天内,每辆车都可以行驶几次。这是一个以下示例:
车辆ID | 开始位置时间 | 结束位置时间 | 持续时间(秒( | 行驶的米数 | |
---|---|---|---|---|---|
XXXXX | 2021-10-26 06:01:12+00:00 | 2021-20-26 06:25:06+000:00 | 1434 | 2000||
XXXXX | 2021-10-19 13:49:09+00:00 | 2021-0-19 13:59:29+00:00 | 620 | 5000 | |
XXXXX | 2021-10-19 13:20:36+000:00 | 2021-20-19 13:26:40+00:00 | 364 | 70000//tr>||
YYYY | 2022-09-10 15:14:07+000:00 | 2022-07-10 15:29:39+00:00 | 932 | 8000||
YYYY | 2022-08-28 15:16:35+000:00 | 2022-28 15:28:43+00:00 | 728 | >90000 |
我相信我已经找到了解决方案。
设置
import pandas as pd
from datetime import timedelta
data = {'vehicleID': {0: 'XXXXX', 1: 'XXXXX', 2: 'XXXXX', 3: 'YYYYY',
4: 'YYYYY'},
'start pos time': {0: '2021-10-26 06:01:12+00:00',
1: '2021-10-19 13:49:09+00:00',
2: '2021-10-19 13:20:36+00:00',
3: '2022-09-10 15:14:07+00:00',
4: '2022-08-28 15:16:35+00:00'},
'end pos time': {0: '2021-10-26 06:25:06+00:00',
1: '2021-10-19 13:59:29+00:00',
2: '2021-10-19 13:26:40+00:00',
3: '2022-09-10 15:29:39+00:00',
4: '2022-08-28 15:28:43+00:00'},
'duration (seconds)': {0: 1434, 1: 620, 2: 364, 3: 932, 4: 728},
'meters travelled': {0: 2000, 1: 5000, 2: 70000, 3: 8000, 4: 90000}
}
df = pd.DataFrame(data)
假设:
- 列
vehicleID
中的所有组(唯一值(按连续顺序排列 - 对于列
vehicleID
中的每个组,列start pos time
中的相关联的时间戳按降序排序
问题
在列vehicleID
内的每个组中,如果开始位置时间小于前一次行程的结束位置时间(即在下一行中(,或小于30分钟,则这些行应成为一行,min
代表起始位置时间,max
代表结束位置时间,而sum
代表持续时间和行进的米数。
解决方案
# if still needed, change date time strings into timestamps
df[['start pos time','end pos time']] = df[['start pos time','end pos time']].
apply(lambda x: pd.to_datetime(x, infer_datetime_format=True))
# check (start time + timedelta 29m+59s) < (end time shifted)
cond1 = (df.loc[:,'end pos time']+timedelta(minutes=29, seconds=59))
.lt(df.loc[:,'start pos time'].shift(1))
# check `vehicleID` != it's own shift (this means a new group is starting)
# i.e. a new group should always get `True`
cond2 = (df.loc[:,'vehicleID'] != df.loc[:,'vehicleID'].shift(1))
# cumsum result of OR check conds
cond = (cond1 | cond2).cumsum()
# apply groupby on ['vehicleID' & cond] and aggregate appropriate functions
# (adding vehicleID is now unnecessary, but this keeps the col in the data)
res = df.groupby(['vehicleID', cond], as_index=False).agg(
{'start pos time':'min',
'end pos time':'max',
'duration (seconds)':'sum',
'meters travelled':'sum'}
)
print(res)
vehicleID start pos time end pos time
0 XXXXX 2021-10-26 06:01:12+00:00 2021-10-26 06:25:06+00:00
1 XXXXX 2021-10-19 13:20:36+00:00 2021-10-19 13:59:29+00:00
2 YYYYY 2022-09-10 15:14:07+00:00 2022-09-10 15:29:39+00:00
3 YYYYY 2022-08-28 15:16:35+00:00 2022-08-28 15:28:43+00:00
duration (seconds) meters travelled
0 1434 2000
1 984 75000
2 932 8000
3 728 90000
我已经进行了一次检查:如果您连续两次以上的行程连续保持在定义的范围内,则解决方案也应该有效。
更新:在@BeRT2me的answer
中,合并为新行的所有原始行的duration (seconds)
的值不会被相加,而是根据新的开始和结束时间重新计算持续时间。这很有道理。如果你想用我的方法做到这一点,只需调整代码的最后一部分如下:
# cut out `duration` here:
res = df.groupby(['vehicleID', cond], as_index=False).agg(
{'start pos time':'min',
'end pos time':'max',
# 'duration (seconds)':'sum',
'meters travelled':'sum'}
)
# and recalculate the duration
res['duration (seconds)'] = res['end pos time'].
sub(res['start pos time']).dt.total_seconds()
可能有一种更有效的编码方法,但类似的方法应该可以工作(new_df有你想要的(:
注意:下面的代码假设开始和结束时间为日期时间格式
df = pd.DataFrame({'vehicleID': {0: 'XXXXX', 1: 'XXXXX', 2: 'XXXXX', 3: 'YYYYY',
4: 'YYYYY'},
'start pos time': {0: '2021-10-26 06:01:12+00:00',
1: '2021-10-19 13:49:09+00:00',
2: '2021-10-19 13:20:36+00:00',
3: '2022-09-10 15:14:07+00:00',
4: '2022-08-28 15:16:35+00:00'},
'end pos time': {0: '2021-10-26 06:25:06+00:00',
1: '2021-10-19 13:59:29+00:00',
2: '2021-10-19 13:26:40+00:00',
3: '2022-09-10 15:29:39+00:00',
4: '2022-08-28 15:28:43+00:00'},
'duration (seconds)': {0: 1434, 1: 620, 2: 364, 3: 932, 4: 728},
'meters travelled': {0: 2000, 1: 5000, 2: 70000, 3: 8000, 4: 90000}
})
# sort dataframe by ID and then start time of trip
df = df.sort_values(by=['vehicleID', 'start pos time'])
# create a new column with the end time of the previous ride
df.loc[:, 'prev end'] = df['end pos time'].shift(1)
# create a new column with the difference between the start time of the current trip and the end time of the prior one
df.loc[:, 'diff'] = df.loc[:, 'start pos time'] - df.loc[:, 'prev end']
# helper function to convert difference between datetime objects to seconds
def get_total_seconds(datetime_delta):
return datetime_delta.total_seconds()
# convert difference column to seconds
df.loc[:, 'diff'] = df['diff'].apply(get_total_seconds)
# where vehicle IDs are the same and the difference between the start time of the current trip and end time of the
# prior trip is less than or equal to 30 minutes, change the start time of the current trip to the start time of the
# prior one
df.loc[((df['vehicleID'] == df['vehicleID'].shift(1)) & (df['diff'] <= 30*60)), 'start pos time'] = df['start pos time'].shift(1)
# create a new dataframe, grouped by vehicle ID and trip start time, using the maximum end time for each group
new_df = df.groupby(['vehicleID', 'start pos time'], as_index=False).agg({'end pos time':'max',
'duration (seconds)':'sum',
'meters travelled':'sum'})
编辑:如果可能存在>2次需要聚合的旅行(正如@ouroboros1所指出的(,您可以替换";将差值列转换为秒";代码:
# [based on @ouroboros1 solution] where vehicle IDs are the same and the difference between the start time of the current
# trip and end time of the prior trip is less than or equal to 30 minutes, put trips in the same "group"
df.loc[:, 'group'] = ((df['vehicleID'] != df['vehicleID'].shift(1)) | (df['diff'] > 30*60)).cumsum()
# create a new dataframe, grouped by vehicle ID and group, using the minimum start time and maximum end time for each group
new_df = df.groupby(['vehicleID', 'group'], as_index=False).agg({'start pos time':'min',
'end pos time':'max',
'duration (seconds)':'sum',
'meters travelled':'sum'})
def func(d):
mask = d.start_pos_time.sub(d.end_pos_time.shift(-1)).lt('30m')
d.loc[mask, 'start_pos_time'] = d.start_pos_time.shift(-1)
d = d.groupby('start_pos_time', as_index=False).agg({'end_pos_time': 'max', 'meters_travelled': 'sum'})
return d
df = df.groupby('vehicleID').apply(func).reset_index('vehicleID').reset_index(drop=True)
df['duration_(seconds)'] = (df.end_pos_time - df.start_pos_time).dt.total_seconds()
print(df)
输出:
vehicleID start_pos_time end_pos_time meters_travelled duration_(seconds)
0 XXXXX 2021-10-19 13:20:36+00:00 2021-10-19 13:59:29+00:00 75000 2333.0
1 XXXXX 2021-10-26 06:01:12+00:00 2021-10-26 06:25:06+00:00 2000 1434.0
2 YYYYY 2022-08-28 15:16:35+00:00 2022-08-28 15:28:43+00:00 90000 728.0
3 YYYYY 2022-09-10 15:14:07+00:00 2022-09-10 15:29:39+00:00 8000 932.0