我目前正在通过数据帧中的GPS坐标进行循环。我使用这个循环来查看具有特定位置的GPS坐标的另一个数据框,并使用最近的位置更新原始数据框。这工作很好,但它是非常慢。有更快的方法吗?
下面是示例数据:
进口:
from shapely.geometry import Point
import pandas as pd
from geopy import distance
创建示例df1
gps_points = [Point(37.773972,-122.431297) , Point(35.4675602,-97.5164276) , Point(42.35843, -71.05977)]
df_gps = pd.DataFrame()
df_gps['points'] = gps_points
创建示例df2
locations = {'location':['San Diego', 'Austin', 'Washington DC'],
'gps':[Point(32.715738 , -117.161084), Point(30.267153 , -97.7430608), Point(38.89511 , -77.03637)]}
df_locations = pd.DataFrame(locations)
两个循环和update:
lst = [] #create empty list to populate new df column
for index , row in df_gps.iterrows(): # iterate over first dataframe rows
point = row['points'] # pull out GPS point
closest_distance = 999999 # create container for distance
closest_location = None #create container for closest location
for index1 , row1 in df_locations.iterrows(): # iterate over second dataframe
name = row1['location'] # assign name of location
point2 = row1['gps'] # assign coordinates of location
distances = distance.distance((point.x , point.y) , (point2.x , point2.y)).miles # calculate distance
if distances < closest_distance: # check to see if distance is closer
closest_distance = distances # if distance is closer assign it
closest_location = name # if distance is closer assign name
lst.append(closest_location) # append closest city
df_gps['closest_city'] = lst # add new column with closest cities
我真的很想以最快的方式做这件事。我读过关于pandas的向量化,并考虑过创建一个函数,然后使用如何在pandas中迭代DataFrame中的行中提到的apply,但是我需要两个循环和一个条件在我的代码中,所以模式崩溃。谢谢你的帮助。
您可以使用Scipy中的KDTree:
from scipy.spatial import KDTree
# Extract lat/lon from your dataframes
points = df_gps['points'].apply(lambda p: (p.x, p.y)).apply(pd.Series)
cities = df_locations['gps'].apply(lambda p: (p.x, p.y)).apply(pd.Series)
distances, indices = KDTree(cities).query(points)
df_gps['closest_city'] = df_locations.iloc[indices]['location'].values
df_gps['distance'] = distances
您可以使用np.where
来过滤太远的距离。
对于性能,检查我的答案,df_gps
有25k行,df_locations
有200k行。
基于Corralien的洞察力,最终的代码答案:
from sklearn.neighbors import BallTree, DistanceMetric
points = df_gps['points'].apply(lambda p: np.radians((p.x, p.y))).apply(pd.Series)
cities = df_locations['gps'].apply(lambda p: np.radians((p.x, p.y))).apply(pd.Series)
dist = DistanceMetric.get_metric('haversine')
tree = BallTree(cities, metric=dist)
dists, cities = tree.query(points)
df_gps['dist'] = dists.flatten() * 3956
df_gps['closest_city'] = df_locations.iloc[cities.flatten()]['location'].values