Palantir Foundry中地理空间索引的最佳方法



对于在Planatir Foundry中构建需要查找多边形(形状(中包含的点的管道,建议采用什么方法?在过去,这在Spark中是相当困难的。GeoSpark一直很受欢迎,但仍可能落后。如果Foundry没有什么特别的东西,我可以用Geospak实现一些东西。我有大约13k个形状和数千个点的批次。

数据集有多大?有了一个足够大的驱动程序和一些优化,我以前使用geopandas让它工作。只需确保坐标点与多边形的投影相同即可。

这里有一个辅助功能:

from shapely import geometry
import json
import geopandas
from pyspark.sql import functions as F

def geopandas_spatial_join(df_left, df_right, geometry_left, geometry_right, how='inner', op='intersects'):
'''
Computes a spatial join of two Geopandas dataframes. Implemetns the Geopandas "sjoin" method, reference: https://geopandas.org/reference/geopandas.sjoin.html.
Expects both dataframes to contain a GeoJSON geometry column, whose names are passed as the 'geometry_left' and 'geometry_right' arguments/
Inputs:
df_left (PANDAS_DATAFRAME): Left input dataframe.
df_right (PANDAS_DATAFRAME): Right input dataframe.
geometry_left (string): Name of the geometry column of the left dataframe.
geometry_right (string): Name of the geometry column of the right dataframe.
how (string): The type of join, one of {'left', 'right', 'inner'}.
op (string): Binary predicate, one of {‘intersects’, ‘contains’, ‘within’}.
Outputs:
(PANDAS_DATAFRAME): Joined dataframe.
'''
df1 = df_left
df1["geometry_left_shape"] = df1[geometry_left].apply(json.loads)
df1["geometry_left_shape"] = df1["geometry_left_shape"].apply(geometry.shape)
gdf_left = geopandas.GeoDataFrame(df1, geometry="geometry_left_shape")
df2 = df_right
df2["geometry_right_shape"] = df2[geometry_right].apply(json.loads)
df2["geometry_right_shape"] = df2["geometry_right_shape"].apply(geometry.shape)
gdf_right = geopandas.GeoDataFrame(df2, geometry="geometry_right_shape")
joined = geopandas.sjoin(gdf_left, gdf_right, how=how, op=op)
joined = joined.drop(joined.filter(items=["geometry_left_shape", "geometry_right_shape"]).columns, axis=1)
return joined

然后我们可以运行加入:

import pandas as pd
left_df = points_df.toPandas()
left_geo_column = "point_geometry"
right_df = polygon_df.toPandas()
right_geo_column = "polygon_geometry"
pdf = geopandas_spatial_join(left_df,right_df,left_geo_column,right_geo_column)
return_df = spark.createDataFrame(pdf).dropDuplicates()
return return_df

相关内容

最新更新