SciKits球树方法给了我不正确的"nearest neighbor"



我使用下面给出的源代码来获得最接近的"站点";。

来源:https://automating-gis-processes.github.io/site/notebooks/L3/nearest-neighbor-faster.html

我的代码:

# Read data from a DB
test_df = pd.read_sql_query(sql, conn)
# Calculates distance between 2 points on a map using lat and long 
# (Source: https://towardsdatascience.com/heres-how-to-calculate-distance-between-2-geolocations-in-python-93ecab5bbba4)
def haversine_distance(lat1, lon1, lat2, lon2):
r = 6371
phi1 = np.radians(float(lat1))
phi2 = np.radians(float(lat2))
delta_phi = np.radians(lat2 - lat1)
delta_lambda = np.radians(lon2- lon1)
a = np.sin(delta_phi / 2)**2 + np.cos(phi1) * np.cos(phi2) *   np.sin(delta_lambda / 2)**2
res = r * (2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a)))
return np.round(res, 2)
test_df["actualDistance (km)"] = test_df.apply(lambda row: haversine_distance(row['ClientLat'],row['ClientLong'],row['actual_SLa'],row['actual_SLo']), axis=1)
test_gdf = geopandas.GeoDataFrame(test_df, geometry=geopandas.points_from_xy(test_df.ClientLong, test_df.ClientLat))
site_gdf = geopandas.GeoDataFrame(site_df, geometry=geopandas.points_from_xy(site_df.SiteLong, site_df.SiteLat))
#-------Set up the functions as shown in the tutorial-------
def get_nearest(src_points, candidates, k_neighbors=1):
"""Find nearest neighbors for all source points from a set of candidate points"""
# Create tree from the candidate points
tree = BallTree(candidates, leaf_size=15, metric='haversine')
# Find closest points and distances
distances, indices = tree.query(src_points, k=k_neighbors)
# Transpose to get distances and indices into arrays
distances = distances.transpose()
indices = indices.transpose()
# Get closest indices and distances (i.e. array at index 0)
# note: for the second closest points, you would take index 1, etc.
closest = indices[0]
closest_dist = distances[0]
# Return indices and distances
return (closest, closest_dist)

def nearest_neighbor(left_gdf, right_gdf, return_dist=False):
"""
For each point in left_gdf, find closest point in right GeoDataFrame and return them.
NOTICE: Assumes that the input Points are in WGS84 projection (lat/lon).
"""
left_geom_col = left_gdf.geometry.name
right_geom_col = right_gdf.geometry.name
# Ensure that index in right gdf is formed of sequential numbers
right = right_gdf.copy().reset_index(drop=True)
# Parse coordinates from points and insert them into a numpy array as RADIANS
left_radians = np.array(left_gdf[left_geom_col].apply(lambda geom: (geom.x * np.pi / 180, geom.y * np.pi / 180)).to_list())
right_radians = np.array(right[right_geom_col].apply(lambda geom: (geom.x * np.pi / 180, geom.y * np.pi / 180)).to_list())
# Find the nearest points
# -----------------------
# closest ==> index in right_gdf that corresponds to the closest point
# dist ==> distance between the nearest neighbors (in meters)
closest, dist = get_nearest(src_points=left_radians, candidates=right_radians)
# Return points from right GeoDataFrame that are closest to points in left GeoDataFrame
closest_points = right.loc[closest]
# Ensure that the index corresponds the one in left_gdf
closest_points = closest_points.reset_index(drop=True)
# Add distance if requested
if return_dist:
# Convert to meters from radians
earth_radius = 6371000  # meters
closest_points['distance'] = dist * earth_radius
return closest_points
closest_sites = nearest_neighbor(test_gdf, site_gdf, return_dist=True)
# Rename the geometry of closest sites gdf so that we can easily identify it
closest_sites = closest_sites.rename(columns={'geometry': 'closest_site_geom'})
# Merge the datasets by index (for this, it is good to use '.join()' -function)
test_gdf = test_gdf.join(closest_sites)
#Extracted closest site latitude and longitude for data analysis
test_gdf['CS_lo'] = test_gdf.closest_site_geom.apply(lambda p: p.x)
test_gdf['CS_la'] = test_gdf.closest_site_geom.apply(lambda p: p.y)

该代码是我提供的教程链接的副本。根据他们的解释,这本应该奏效。

为了验证这些数据,我使用.describe()获得了一些统计数据,它向我表明,教程方法确实给了我一个比实际数据中的距离近得多的平均距离(792米与1.80公里的实际距离(。使用BallTree方法生成的最近距离数据中的实际距离

然而,当我使用plotly在地图上绘制它们时,我注意到BallTree方法的输出并不比";实际的";距离这通常是绘制数据的样子(蓝色:预先确定的站点,红色:使用BallTree方法预测的站点有人能帮我找出差异吗

问题是教程代码提供的坐标是Longitude, Latitude格式,而不是BallTree预期的Latitude, Longitude格式。所以你在测量倒置点之间的距离。

如果您在坐标解析代码中交换geom.x和geom.y的顺序,您将获得正确的测量值。

# Parse coordinates from points and insert them into a numpy array as RADIANS
left_radians = np.array(left_gdf[left_geom_col].apply(lambda geom: (geom.y * np.pi / 180, geom.x * np.pi / 180)).to_list())
right_radians = np.array(right[right_geom_col].apply(lambda geom: (geom.y * np.pi / 180, geom.x * np.pi / 180)).to_list())

我不确定为什么这样做,但确实如此。我决定只根据文档编写代码,而不是遵循教程,这很有效:

# Build BallTree with haversine distance metric, which expects (lat, lon) in radians and returns distances in radians
dist = DistanceMetric.get_metric('haversine')
tree = BallTree(np.radians(site_df[['SiteLat', 'SiteLong']]), metric=dist)
test_coords = np.radians(test_df[['ClientLat', 'ClientLong']])
dists, ilocs = tree.query(test_coords)

最新更新