我在Pandas中有一个数据帧,其中包含如下所示的邮政编码数据
邮政编码 | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
AB1 0AA | AB1 0AB | AB1 0AD | AB1 0AR | AB1 0AS | AB1 0AT |
我注意到,在您的示例df
中,"district"one_answers"coordinates"似乎被翻转了。下面的代码使用您的df作为
步骤是按"坐标"(阿伯丁郡等(分组,然后对其自身进行交叉合并,创建所有可能的邮政编码对,应用距离函数,并为每组取最大值的行
df1 = df.groupby('coordinates').apply(lambda g: g.merge(g, how = 'cross'))
df1['dist'] = df1.apply(lambda r: hs.haversine((r['lat_x'], r['long_x']),((r['lat_y'], r['long_y']))), axis=1)
df1.sort_values('dist', ascending = False).groupby('coordinates_x').head(1).reset_index(drop = True)
输出
postcode_x lat_x long_x district_x coordinates_x postcode_y lat_y long_y district_y coordinates_y dist
-- ------------ ------- -------- --------------------- --------------- ------------ ------- -------- --------------------- --------------- --------
0 AB1 0AR 57.0914 -2.22483 (57.091357 -2.224831) Aberdeenshire AB1 0AS 57.0838 -2.23444 (57.083838 -2.234437) Aberdeenshire 1.01777
1 AB1 0AA 57.1015 -2.24285 (57.101474 -2.242851) Aberdeen City AB1 0AD 57.1006 -2.24834 (57.100556 -2.248342) Aberdeen City 0.346992
您可以这样做:
创建数据帧的副本
df1 = df
df2 = df
temp = df1.assign(A=1).merge(df2.assign(A=1), on='A').drop('A', 1)
定义Haersine函数:
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
if to_radians:
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 +
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * np.arcsin(np.sqrt(a))
并计算出所有成对距离:
temp['distances'] = temp.apply(lambda x: haversine(x['lat_x'], x['long_x'],x['lat_y'], x['long_y']), 1)
它给出:
postcode_x lat_x long_x district_x coordinates_x
0 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
1 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
2 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
3 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
4 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
5 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
6 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
7 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
8 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
9 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
10 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
11 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
12 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
13 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
14 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
15 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
16 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
17 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
18 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
19 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
20 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
21 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
22 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
23 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
24 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
25 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
26 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
27 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
28 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
29 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
30 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
31 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
32 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
33 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
34 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
35 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
postcode_y lat_y long_y district_y coordinates_y
0 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
1 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
2 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
3 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
4 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
5 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
6 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
7 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
8 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
9 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
10 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
11 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
12 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
13 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
14 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
15 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
16 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
17 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
18 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
19 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
20 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
21 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
22 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
23 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
24 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
25 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
26 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
27 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
28 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
29 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
30 AB1 0AA 57.101474 -2.242851 (57.101474 ,-2.242851) Aberdeen City
31 AB1 0AB 57.102554 -2.246308 (57.102554 ,-2.246308) Aberdeen City
32 AB1 0AD 57.100556 -2.248342 (57.100556 ,-2.248342) Aberdeen City
33 AB1 0AR 57.091357 -2.224831 (57.091357, -2.224831) Aberdeenshire
34 AB1 0AS 57.083838 -2.234437 (57.083838 ,-2.234437) Aberdeenshire
35 AB1 0AT 57.089299 -2.239768 (57.089299 ,-2.239768) Aberdeenshire
distances
0 0.000000
1 0.240859
2 0.346992
3 1.565351
4 2.025836
5 1.366547
6 0.240859
7 0.000000
8 0.253869
9 1.798078
10 2.201213
11 1.525913
12 0.346992
13 0.253869
14 0.000000
15 1.750198
16 2.039937
17 1.354641
18 1.565351
19 1.798078
20 1.750198
21 0.000000
22 1.017773
23 0.930967
24 2.025836
25 2.201213
26 2.039937
27 1.017773
28 0.000000
29 0.687374
30 1.366547
31 1.525913
32 1.354641
33 0.930967
34 0.687374
35 0.000000
在找到我认为您正在使用的数据集(英国邮政编码的完整列表(后更新我的答案。伯明翰似乎是邮政编码最多的地区,有34459个邮政编码。34459 x 34459的距离矩阵正好位于我的16GB RAM机器在没有内存不足错误的情况下能够处理的边缘,但它最终达到了目标。
haversine距离公式的scikit-learn
矢量化实现似乎比我在之前的解决方案中发布的要快得多,所以我在之前发布的函数中编辑了dm =
行。我还添加了一份打印声明,显示了该地区目前正在进行的工作,这样你就可以了解进展情况。如果你能通过伯明翰,它应该会完成。
给我想要的解决方案的完整代码如下:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import haversine_distances
def get_furthest_in_district(df):
print(df['district'].iloc[0], len(df))
dm = haversine_distances(np.deg2rad(df[['lat', 'long']])) * 6371
idx_1, idx_2 = np.unravel_index(np.argmax(dm), dm.shape)
postcode_1 = df['postcode'].iloc[idx_1]
postcode_2 = df['postcode'].iloc[idx_2]
distance = dm[idx_1, idx_2]
return pd.Series(
data=[postcode_1, postcode_2, distance],
index=['postcode_1', 'postcode_2', 'distance']
)
results = df.groupby('district').apply(get_furthest_in_district)
给出374行的最终DataFrame(每个区域一行(,前5行如下:
postcode_1 postcode_2 distance
district
Aberdeen City AB23 8BS AB31 3AS 20.252278
Aberdeenshire AB3 5YB AB43 8WA 133.891028
Adur BN14 9JU BN4 1PY 9.932842
Allerdale CA12 4TP CA7 5BP 50.893348
Amber Valley DE55 7EG DE6 5BG 23.678767