Pandas Dataframe按地区查找相距最远的两个邮政编码



我在Pandas中有一个数据帧,其中包含如下所示的邮政编码数据

邮政编码
AB1 0AA AB1 0AB AB1 0AD AB1 0AR AB1 0AS AB1 0AT

我注意到,在您的示例df中,"district"one_answers"coordinates"似乎被翻转了。下面的代码使用您的df作为

步骤是按"坐标"(阿伯丁郡等(分组,然后对其自身进行交叉合并,创建所有可能的邮政编码对,应用距离函数,并为每组取最大值的行

df1 = df.groupby('coordinates').apply(lambda g: g.merge(g, how = 'cross'))
df1['dist'] = df1.apply(lambda r: hs.haversine((r['lat_x'], r['long_x']),((r['lat_y'], r['long_y']))), axis=1)
df1.sort_values('dist', ascending = False).groupby('coordinates_x').head(1).reset_index(drop = True)

输出

postcode_x      lat_x    long_x  district_x             coordinates_x    postcode_y      lat_y    long_y  district_y             coordinates_y        dist
--  ------------  -------  --------  ---------------------  ---------------  ------------  -------  --------  ---------------------  ---------------  --------
0  AB1 0AR       57.0914  -2.22483  (57.091357 -2.224831)  Aberdeenshire    AB1 0AS       57.0838  -2.23444  (57.083838 -2.234437)  Aberdeenshire    1.01777
1  AB1 0AA       57.1015  -2.24285  (57.101474 -2.242851)  Aberdeen City    AB1 0AD       57.1006  -2.24834  (57.100556 -2.248342)  Aberdeen City    0.346992

您可以这样做:

创建数据帧的副本

df1 = df
df2 = df
temp = df1.assign(A=1).merge(df2.assign(A=1), on='A').drop('A', 1) 

定义Haersine函数:

def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
if to_radians:
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + 
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * np.arcsin(np.sqrt(a))

并计算出所有成对距离:

temp['distances'] = temp.apply(lambda x: haversine(x['lat_x'], x['long_x'],x['lat_y'], x['long_y']), 1)

它给出:


postcode_x      lat_x    long_x              district_x  coordinates_x  
0     AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
1     AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
2     AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
3     AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
4     AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
5     AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
6     AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
7     AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
8     AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
9     AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
10    AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
11    AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
12    AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
13    AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
14    AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
15    AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
16    AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
17    AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
18    AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
19    AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
20    AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
21    AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
22    AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
23    AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
24    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
25    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
26    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
27    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
28    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
29    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
30    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
31    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
32    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
33    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
34    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
35    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
postcode_y      lat_y    long_y              district_y  coordinates_y  
0     AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
1     AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
2     AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
3     AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
4     AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
5     AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
6     AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
7     AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
8     AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
9     AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
10    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
11    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
12    AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
13    AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
14    AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
15    AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
16    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
17    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
18    AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
19    AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
20    AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
21    AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
22    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
23    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
24    AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
25    AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
26    AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
27    AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
28    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
29    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
30    AB1 0AA  57.101474 -2.242851  (57.101474 ,-2.242851)  Aberdeen City   
31    AB1 0AB  57.102554 -2.246308  (57.102554 ,-2.246308)  Aberdeen City   
32    AB1 0AD  57.100556 -2.248342  (57.100556 ,-2.248342)  Aberdeen City   
33    AB1 0AR  57.091357 -2.224831  (57.091357, -2.224831)  Aberdeenshire   
34    AB1 0AS  57.083838 -2.234437  (57.083838 ,-2.234437)  Aberdeenshire   
35    AB1 0AT  57.089299 -2.239768  (57.089299 ,-2.239768)  Aberdeenshire   
distances  
0    0.000000  
1    0.240859  
2    0.346992  
3    1.565351  
4    2.025836  
5    1.366547  
6    0.240859  
7    0.000000  
8    0.253869  
9    1.798078  
10   2.201213  
11   1.525913  
12   0.346992  
13   0.253869  
14   0.000000  
15   1.750198  
16   2.039937  
17   1.354641  
18   1.565351  
19   1.798078  
20   1.750198  
21   0.000000  
22   1.017773  
23   0.930967  
24   2.025836  
25   2.201213  
26   2.039937  
27   1.017773  
28   0.000000  
29   0.687374  
30   1.366547  
31   1.525913  
32   1.354641  
33   0.930967  
34   0.687374  
35   0.000000  

在找到我认为您正在使用的数据集(英国邮政编码的完整列表(后更新我的答案。伯明翰似乎是邮政编码最多的地区,有34459个邮政编码。34459 x 34459的距离矩阵正好位于我的16GB RAM机器在没有内存不足错误的情况下能够处理的边缘,但它最终达到了目标。

haversine距离公式的scikit-learn矢量化实现似乎比我在之前的解决方案中发布的要快得多,所以我在之前发布的函数中编辑了dm =行。我还添加了一份打印声明,显示了该地区目前正在进行的工作,这样你就可以了解进展情况。如果你能通过伯明翰,它应该会完成。

给我想要的解决方案的完整代码如下:

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import haversine_distances

def get_furthest_in_district(df):
print(df['district'].iloc[0], len(df))
dm = haversine_distances(np.deg2rad(df[['lat', 'long']])) * 6371
idx_1, idx_2 = np.unravel_index(np.argmax(dm), dm.shape)
postcode_1 = df['postcode'].iloc[idx_1]
postcode_2 = df['postcode'].iloc[idx_2]
distance = dm[idx_1, idx_2]
return pd.Series(
data=[postcode_1, postcode_2, distance],
index=['postcode_1', 'postcode_2', 'distance']
)

results = df.groupby('district').apply(get_furthest_in_district)

给出374行的最终DataFrame(每个区域一行(,前5行如下:

postcode_1 postcode_2    distance
district                                       
Aberdeen City   AB23 8BS   AB31 3AS   20.252278
Aberdeenshire    AB3 5YB   AB43 8WA  133.891028
Adur            BN14 9JU    BN4 1PY    9.932842
Allerdale       CA12 4TP    CA7 5BP   50.893348
Amber Valley    DE55 7EG    DE6 5BG   23.678767

最新更新