我正在做两个临时表的 UNION 并尝试按列排序,但 Spark 抱怨我订购的列无法解决。这是一个错误还是我错过了什么?
lazy val spark: SparkSession = SparkSession.builder.master("local[*]").getOrCreate()
import org.apache.spark.sql.types.StringType
val oldOrders = Seq(
Seq("old_order_id1", "old_order_name1", "true"),
Seq("old_order_id2", "old_order_name2", "true")
)
val newOrders = Seq(
Seq("new_order_id1", "new_order_name1", "false"),
Seq("new_order_id2", "new_order_name2", "false")
)
val schema = new StructType()
.add("id", StringType)
.add("name", StringType)
.add("is_old", StringType)
val oldOrdersDF = spark.createDataFrame(spark.sparkContext.makeRDD(oldOrders.map(x => Row(x:_*))), schema)
val newOrdersDF = spark.createDataFrame(spark.sparkContext.makeRDD(newOrders.map(x => Row(x:_*))), schema)
oldOrdersDF.createOrReplaceTempView("old_orders")
newOrdersDF.createOrReplaceTempView("new_orders")
//ordering by column not in select works if I'm not doing UNION
spark.sql(
"""
|SELECT oo.id, oo.name FROM old_orders oo
|ORDER BY oo.is_old
""".stripMargin).show()
//ordering by column not in select doesn't work as I'm doing a UNION
spark.sql(
"""
|SELECT oo.id, oo.name FROM old_orders oo
|UNION
|SELECT no.id, no.name FROM new_orders no
|ORDER BY oo.is_old
""".stripMargin).show()
The output of the above code is:
+-------------+---------------+
| id| name|
+-------------+---------------+
|old_order_id1|old_order_name1|
|old_order_id2|old_order_name2|
+-------------+---------------+
cannot resolve '`oo.is_old`' given input columns: [id, name]; line 5 pos 9;
'Sort ['oo.is_old ASC NULLS FIRST], true
+- Distinct
+- Union
:- Project [id#121, name#122]
: +- SubqueryAlias oo
: +- SubqueryAlias old_orders
: +- LogicalRDD [id#121, name#122, is_old#123]
+- Project [id#131, name#132]
+- SubqueryAlias no
+- SubqueryAlias new_orders
+- LogicalRDD [id#131, name#132, is_old#133]
org.apache.spark.sql.AnalysisException: cannot resolve '`oo.is_old`' given input columns: [id, name]; line 5 pos 9;
'Sort ['oo.is_old ASC NULLS FIRST], true
+- Distinct
+- Union
:- Project [id#121, name#122]
: +- SubqueryAlias oo
: +- SubqueryAlias old_orders
: +- LogicalRDD [id#121, name#122, is_old#123]
+- Project [id#131, name#132]
+- SubqueryAlias no
+- SubqueryAlias new_orders
+- LogicalRDD [id#131, name#132, is_old#133]
因此,如果我不执行 UNION,则按不在 SELECT 子句中的列进行排序是有效的,如果我执行两个表的 UNION,则失败。
Spark SQL的语法与SQL非常相似,但它们的工作方式却大不相同。在Spark的引擎盖下,它都是关于Rdds/数据帧的。
在 UNION 语句之后,将生成一个新的数据帧,如果我们没有选择旧表/数据帧中的字段,我们将无法引用它们。
如何解决:
spark.sql(
"""
|SELECT id, name
|FROM (
| SELECT oo.id, oo.name, oo.is_old FROM old_orders oo
| UNION
| SELECT no.id, no.name, no.is_old FROM new_orders no
| ORDER BY oo.is_old
| ) t
""".stripMargin).show()
谢谢。