>上下文：我正在寻找一种方法来有效地计算，在 PySpark 中，一对纬度长和一组纬度长之间的距离，然后取这些距离的最小值。

这将如何工作：

第一步：我有一个Spark数据帧，其中包含以纬度和经度作为列的餐厅ID

# Something like this
>>> restaurants_df
restaurant id | lat   | long 
123           | 32.34 | 54.62

第二步：我有一个由加油站组成的熊猫数据框

>>> gas_stations_df
gas_station id | lat   | long 
456            | 76.22 | 64.24
789            | 24.65 | 35.55

第三步：我现在想计算每家餐厅和所有加油站位置之间的哈弗正弦距离，然后得到最小距离！所以让我们说：
- Haversine 距离 b/w 餐厅 ID123和加油站456= 5m
- Haversine 距离 b/w 餐厅 ID 123 和加油站789=12m

然后我想返回 5m 作为值，因为它是最低距离。我想为所有餐厅 ID 执行此操作。一些 sudo 代码可以更好地理解这个问题：

# Sudo code to understand desired logic
for each_restaurant in a list of restaurants:
calculate the distance between the restaurant and ALL the gas stations
return minimum distance

迄今取得的进展

到目前为止，我已经使用了矢量化熊猫UDF和普通UDF，如下所示

def haversine_distance(lat, long):
"""Get haversine distances from a single (lat, long) pair to an array
of (lat, long) pairs.
"""
# Convert the lat long to radians
lat = lat.apply(lambda x: radians(x))
long = long.apply(lambda x: radians(x))
unit = 'm'
single_loc = pd.DataFrame( [lat,  long] ).T
single_loc.columns = ['Latitude', 'Longitude']
other_locs = gas_stations_df[['Latitude', 'Longitude']].values  # this is a pandas dataframe
dist_l = []
for index,row in single_loc.iterrows():
.... do haversine distance calculations
d = haversine distance

dist_l.append(np.min(d) )
return pd.Series(dist_l)

然后我按如下方式应用熊猫UDF：

restaurant_df = restaurant_df.withColumn('distance_to_nearest_gas_station', lit(haversine_distance('latitude', 'longitude')))

虽然这种方法有效，但它的扩展速度仍然相当慢，我想知道是否有更简单的方法可以做到这一点？

非常感谢您的阅读！

我会忽略开头的"haversine"要求，并使用 k-d 树(2 维或 3 维(将其过滤到几个点，这应该非常快。如果你想要/需要那个点的确切距离，你可以使用任何你想要的公式。

在 PySpark 中计算拉长和拉特长数组之间的最小哈弗正弦距离的最快方法？

这将如何工作：

迄今取得的进展

相关内容

最新更新

热门标签：