我正在Databriks上使用pyspark
,在那里我有一个数据点表,看起来像下面的
pingsGeo.show(5)
+--------------------+--------------------+----------+--------------------+
| ID| point| date| distance|
+--------------------+--------------------+----------+--------------------+
|00007436cf7f96cb1...|POINT (-82.640937...|2020-03-19|0.022844737780675896|
|00007436cf7f96cb1...|POINT (-82.641281...|2020-03-19|3.946137920280456...|
|00007436cf7f96cb1...|POINT (-82.650238...|2020-03-19| 0.00951798692682881|
|00007436cf7f96cb1...|POINT (-82.650947...|2020-03-19|7.503617154519347E-4|
|00007436cf7f96cb1...|POINT (-82.655853...|2020-03-19|0.007148426134394903|
+--------------------+--------------------+----------+--------------------+
root
|-- ID: string (nullable = true)
|-- point: geometry (nullable = false)
|-- date: date (nullable = true)
|-- distance: double (nullable = false)
和另一个多边形表(来自shapefile(
zoneShapes.show(5)
+--------+--------------------+
|COUNTYNS| geometry|
+--------+--------------------+
|01026336|POLYGON ((-78.901...|
|01025844|POLYGON ((-80.497...|
|01074088|POLYGON ((-81.686...|
|01213687|POLYGON ((-76.813...|
|01384015|POLYGON ((-95.152...|
我想为每个点分配一个COUNTYNS
我是用geospark
函数来做的。我正在做以下事情:
queryOverlap = """
SELECT p.ID, z.COUNTYNS as zone, p.date, p.point, p.distance
FROM pingsGeo as p, zoneShapes as z
WHERE ST_Intersects(p.point, z.geometry))
"""
spark.sql(queryOverlap).show(5)
此查询在较小的数据集中有效,但在较大的数据集中失败。
org.apache.spark.SparkException: Job aborted due to stage failure: Task 117 in stage 51.0 failed 4 times, most recent failure: Lost task 117.3 in stage 51.0 (TID 4879, 10.17.21.12, executor 13): org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384 bytes of memory, got 0
我想知道是否有一种方法可以优化流程。
您的问题有点模糊,但以下是我首先要考虑的内容。。
有几件事需要考虑:1.你的星火团可用的物理资源2.表的分区-如果分区不正确,可能会进行比默认大的数据洗牌
另外,考虑在最大的表上使用索引。