从两个Pandas dataframe中添加一个列到一个dataframe,目前使用两个有条件的循环:有没有更快的方法?



我目前正在通过数据帧中的GPS坐标进行循环。我使用这个循环来查看具有特定位置的GPS坐标的另一个数据框,并使用最近的位置更新原始数据框。这工作很好,但它是非常慢。有更快的方法吗?

下面是示例数据:

进口:

from shapely.geometry import Point
import pandas as pd
from geopy import distance

创建示例df1

gps_points = [Point(37.773972,-122.431297) , Point(35.4675602,-97.5164276) , Point(42.35843, -71.05977)]
df_gps = pd.DataFrame()
df_gps['points'] = gps_points

创建示例df2

locations = {'location':['San Diego', 'Austin', 'Washington DC'],
'gps':[Point(32.715738 , -117.161084), Point(30.267153 , -97.7430608), Point(38.89511 , -77.03637)]}
df_locations = pd.DataFrame(locations)

两个循环和update:

lst = [] #create empty list to populate new df column
for index , row in df_gps.iterrows(): # iterate over first dataframe rows
point = row['points'] # pull out GPS point
closest_distance = 999999 # create container for distance
closest_location = None #create container for closest location
for index1 , row1 in df_locations.iterrows(): # iterate over second dataframe
name = row1['location'] # assign name of location
point2 = row1['gps'] # assign coordinates of location
distances = distance.distance((point.x , point.y) , (point2.x , point2.y)).miles # calculate distance
if distances < closest_distance: # check to see if distance is closer
closest_distance = distances # if distance is closer assign it
closest_location = name # if distance is closer assign name
lst.append(closest_location) # append closest city
df_gps['closest_city'] = lst # add new column with closest cities

我真的很想以最快的方式做这件事。我读过关于pandas的向量化,并考虑过创建一个函数,然后使用如何在pandas中迭代DataFrame中的行中提到的apply,但是我需要两个循环和一个条件在我的代码中,所以模式崩溃。谢谢你的帮助。

您可以使用Scipy中的KDTree:

from scipy.spatial import KDTree
# Extract lat/lon from your dataframes
points = df_gps['points'].apply(lambda p: (p.x, p.y)).apply(pd.Series)
cities = df_locations['gps'].apply(lambda p: (p.x, p.y)).apply(pd.Series)
distances, indices = KDTree(cities).query(points)
df_gps['closest_city'] = df_locations.iloc[indices]['location'].values
df_gps['distance'] = distances

您可以使用np.where来过滤太远的距离。

对于性能,检查我的答案,df_gps有25k行,df_locations有200k行。

基于Corralien的洞察力,最终的代码答案:

from sklearn.neighbors import BallTree, DistanceMetric
points = df_gps['points'].apply(lambda p: np.radians((p.x, p.y))).apply(pd.Series)
cities = df_locations['gps'].apply(lambda p: np.radians((p.x, p.y))).apply(pd.Series)
dist = DistanceMetric.get_metric('haversine')
tree = BallTree(cities, metric=dist)
dists, cities = tree.query(points)
df_gps['dist'] = dists.flatten() * 3956
df_gps['closest_city'] = df_locations.iloc[cities.flatten()]['location'].values