Apache Spark SQL 查询和数据帧作为参考数据



我有两个Spark数据帧:

cities具有以下列的数据帧:

city
-----
London
Austin

bigCities 具有以下列的数据帧:

name
------
London
Cairo

我需要转换数据帧cities并在那里添加一个额外的布尔列:bigCity此列的值必须根据以下条件计算"cities.city IN bigCities.name"

我可以通过以下方式执行此操作(使用静态大城市集合):

cities.createOrReplaceTempView("cities")
var resultDf = spark.sql("SELECT city, CASE WHEN city IN ['London', 'Cairo'] THEN 'Y' ELSE 'N' END AS bigCity FROM cities")

但我不知道如何在查询中将静态 bigCities 集合['London', 'Cairo']替换为bigCities数据帧。我想使用 bigCities 作为查询中的参考数据。

请告知如何实现此目的。

val df = cities.join(bigCities, $"name".equalTo($"city"), "leftouter").
                withColumn("bigCity", when($"name".isNull, "N").otherwise("Y")).
                drop("name")

你可以在 bigCities 表上使用 collect_list()。看看这个

scala> val df_city = Seq(("London"),("Austin")).toDF("city")
df_city: org.apache.spark.sql.DataFrame = [city: string]
scala> val df_bigCities = Seq(("London"),("Cairo")).toDF("name")
df_bigCities: org.apache.spark.sql.DataFrame = [name: string]
scala> df_city.createOrReplaceTempView("cities")
scala> df_bigCities.createOrReplaceTempView("bigCities")
scala> spark.sql(" select city, case when array_contains((select collect_list(name) from bigcities),city) then 'Y' else 'N' end as bigCity from cities").show(false)
+------+-------+
|city  |bigCity|
+------+-------+
|London|Y      |
|Austin|N      |
+------+-------+

scala>

如果数据集很大,则可以使用效率更高的collect_set。

scala> spark.sql(" select city, case when array_contains((select collect_set(name) from bigcities),city) then 'Y' else 'N' end as bigCity from cities").show(false)
+------+-------+
|city  |bigCity|
+------+-------+
|London|Y      |
|Austin|N      |
+------+-------+

scala>

相关内容

  • 没有找到相关文章

最新更新