我想计算自变量event
等于 1 以来覆盖的距离。重要的是,应为每个ID计算距离。
我的数据集由一个date
、car_id
、latitude
、longitude
和一个指示事件或不指示事件的虚拟变量组成。我用于计算距离的公式是:
def Distance(Latitude, Longitude, LatitudeDecimal, LongitudeDecimal):
az12,az21,dist = wgs84_geod.inv(Longitude, Latitude, LongitudeDecimal, LatitudeDecimal)
return dist
我想要的是计算自上次event==1
以来每个car_id
的两个地理点之间的距离,因此列distance_since_event
:
date car_id latitude longitude event distance_since_event
01/01/2019 1 43.5863 7.12993 0 -1
01/01/2019 2 44.3929 8.93832 0 -1
02/01/2019 1 43.5393 7.03134 1 -1
02/01/2019 2 39.459462 -0.312280 0 -1
03/01/2019 1 44.3173 84.942 0 calculation=(distance from 02/01/2019-03/01/2019 for ID=1)
03/01/2019 2 -12.3284 -9.04522 1 -1
04/01/2019 1 -36.8414 17.4762 0 calculation=(distance from 02/01/2019-04/01/2019 for ID=1)
04/01/2019 2 43.542 10.2958 0 calculation=(distance from 03/01/2019-04/01/2019 for ID=2)
05/01/2019 1 43.5242 69.473 0 calculation=(distance from 02/01/2019-05/01/2019 for ID=1)
05/01/2019 2 37.9382 23.668 1 calculation=(distance from 03/01/2019-05/01/2019 for ID=2)
06/01/2019 1 4.4409 89.218 1 calculation=(distance from 02/01/2019-06/01/2019 for ID=1)
06/02/2019 2 25.078037 -77.328900 0 calculation=(distance from 05/01/2019-06/01/2019 for ID=2)
这里帮助你的关键功能是pandas.merge_asof
allow_exact_matches=False
import pandas as pd
input = pd.DataFrame([
["01/01/2019", 1, 43.5863 , 7.12993, 0],
["01/01/2019", 2, 44.3929 , 8.93832, 0],
["02/01/2019", 1, 43.5393 , 7.03134, 1],
["02/01/2019", 2, 39.459462, -0.31228, 0],
["03/01/2019", 1, 44.3173 , 84.942, 0],
["03/01/2019", 2, -12.3284 ,-9.04522, 1],
["04/01/2019", 1, -36.8414 ,17.4762, 0],
["04/01/2019", 2, 43.542 , 10.2958, 0],
["05/01/2019", 1, 43.5242 , 69.473, 0],
["05/01/2019", 2, 37.9382 , 23.668, 1],
["06/01/2019", 1, 4.4409 , 89.218, 1],
["06/02/2019", 2, 25.078037, -77.3289, 0]],
columns=["date","car_id","latitude", "longitude" , "event"])
input['date'] = pd.to_datetime(input['date'])
df = pd.merge_asof(input.set_index('date'), input.loc[input['event'] == 1].set_index('date'),
on='date', suffixes=['_l','_r'], by='car_id', allow_exact_matches=False)
此时,df 中的每一行都已包含进一步计算所需的必要元素。由于我不确定您的Distance()
函数是否接受数据帧,因此我们可以使用 .apply()
来追加distance_since_event
列。
def getDistance(lat1, lat2, long1, long2):
if pd.isna(lat2) or pd.isna(long2):
return -1
# substitute this with the actual wgs84_geod library that you eventually use
return ((lat2-lat1)**2 + (long2-long1)**2) **0.5
df['distance_since_event'] = df.apply(lambda row: getDistance(row['latitude_l'], row['latitude_r'], row['longitude_l'], row['longitude_r']), axis=1)
print(df)
输出:
car_id date latitude_l longitude_l event_l latitude_r longitude_r event_r distance_since_event
0 1 2019-01-01 43.586300 7.12993 0 NaN NaN NaN -1.000000
1 2 2019-01-01 44.392900 8.93832 0 NaN NaN NaN -1.000000
2 1 2019-02-01 43.539300 7.03134 1 NaN NaN NaN -1.000000
3 2 2019-02-01 39.459462 -0.31228 0 NaN NaN NaN -1.000000
4 1 2019-03-01 44.317300 84.94200 0 43.5393 7.03134 1.0 77.914544
5 2 2019-03-01 -12.328400 -9.04522 1 NaN NaN NaN -1.000000
6 1 2019-04-01 -36.841400 17.47620 0 43.5393 7.03134 1.0 81.056474
7 2 2019-04-01 43.542000 10.29580 0 -12.3284 -9.04522 1.0 59.123402
8 1 2019-05-01 43.524200 69.47300 0 43.5393 7.03134 1.0 62.441662
9 2 2019-05-01 37.938200 23.66800 1 -12.3284 -9.04522 1.0 59.974043
10 1 2019-06-01 4.440900 89.21800 1 43.5393 7.03134 1.0 91.012812
11 2 2019-06-02 25.078037 -77.32890 0 37.9382 23.66800 1.0 101.812365
从这里您可以根据需要重命名或删除列