如何在 dataFrameToAdd.lable != dataFrameMain.label 和距离 res 小于 0.0002 的情况下将 dataFrameToAdd 添加到 dataFrameMain ?
case class Schema(name: String,label: String, lat: Double, lon: Double)
val dataFrameMain = sc.parallelize(Array(
Schema("recordA","house",54.78049,-1.57679 ),
Schema("recordB","hotel",52.02724,-2.16572 ),
Schema("recordC","hotel",52.51423,-1.97814 ),
Schema("recordD","house",51.46966,-0.45227 ),
Schema("recordE","house",50.91608,-1.45803 ),
Schema("recordF","house",52.59754,-1.07599 )
)).toDF()
val dataFrameToAdd = sc.parallelize(Array(
Schema("recordAduplicate","house", 54.780705, -1.576777),
Schema("recordBnotDuplicate","hotel",54.783477, -1.57986 )
)).toDF()
def distance(latDF: Double, lonDF: Double,latNEW: Double, lonNEW: Double): Double = {
val dx = latNEW - latDF
val dy = lonNEW - lonDF
val res = math.sqrt(dx*dx + dy*dy)
return res }
import org.apache.spark.sql.functions.udf
sqlContext.udf.register("distance",distance(_:Double,_:Double, _:Double, _:Double ): Double)
我不确定如何处理这个问题。我应该应用转置函数还是包括 Mlib 矩阵数据结构?作为此示例的输出,来自 dataFrameToAdd 的 recordBnotDuplicate 应与 dataFrameMain 合并,因为它的距离大于 0.0002,但不是 recordAduplicat,因为它与来自 dataFrameMain 的 recordA 具有相同的表,并且距离值小于 0.0002
这里有一种方法 - 不确定它在性能方面是否是最有效的方法:
注册 UDF 后,立即将每个 DF 注册为 tmp 表,并使用 LEFT JOIN 选择与 A 的任何记录都不匹配的所有 B 记录;然后 - 将结果与 A 联合:
dataFrameMain.registerTempTable("a")
dataFrameToAdd.registerTempTable("b")
val withoutDuplicates: DataFrame = sqlContext.sql(
"""
|SELECT b.*
|FROM b
|LEFT JOIN a ON a.label = b.label AND distance(a.lat, a.lon, b.lat, b.lon) <= 0.002
|WHERE a.name IS NULL
""".stripMargin)
val result = withoutDuplicates.unionAll(dataFrameMain)
打印结果会给出预期的结果:
+-------------------+-----+---------+--------+
| name|label| lat| lon|
+-------------------+-----+---------+--------+
|recordBnotDuplicate|hotel|54.783477|-1.57986|
| recordA|house| 54.78049|-1.57679|
| recordB|hotel| 52.02724|-2.16572|
| recordC|hotel| 52.51423|-1.97814|
| recordD|house| 51.46966|-0.45227|
| recordE|house| 50.91608|-1.45803|
| recordF|house| 52.59754|-1.07599|
+-------------------+-----+---------+--------+