Spark SQL UNION - 排序依据列不在“选择”中



我正在做两个临时表的 UNION 并尝试按列排序,但 Spark 抱怨我订购的列无法解决。这是一个错误还是我错过了什么?

lazy val spark: SparkSession = SparkSession.builder.master("local[*]").getOrCreate()
      import org.apache.spark.sql.types.StringType
      val oldOrders = Seq(
        Seq("old_order_id1", "old_order_name1", "true"),
        Seq("old_order_id2", "old_order_name2", "true")
      )
      val newOrders = Seq(
        Seq("new_order_id1", "new_order_name1", "false"),
        Seq("new_order_id2", "new_order_name2", "false")
      )
      val schema = new StructType()
        .add("id", StringType)
        .add("name", StringType)
        .add("is_old", StringType)
      val oldOrdersDF = spark.createDataFrame(spark.sparkContext.makeRDD(oldOrders.map(x => Row(x:_*))), schema)
      val newOrdersDF = spark.createDataFrame(spark.sparkContext.makeRDD(newOrders.map(x => Row(x:_*))), schema)
      oldOrdersDF.createOrReplaceTempView("old_orders")
      newOrdersDF.createOrReplaceTempView("new_orders")
      //ordering by column not in select works if I'm not doing UNION
      spark.sql(
        """
          |SELECT oo.id, oo.name FROM old_orders oo
          |ORDER BY oo.is_old
        """.stripMargin).show()
      //ordering by column not in select doesn't work as I'm doing a UNION
      spark.sql(
        """
          |SELECT oo.id, oo.name FROM old_orders oo
          |UNION
          |SELECT no.id, no.name FROM new_orders no
          |ORDER BY oo.is_old
        """.stripMargin).show()
The output of the above code is:
    +-------------+---------------+
    |           id|           name|
    +-------------+---------------+
    |old_order_id1|old_order_name1|
    |old_order_id2|old_order_name2|
    +-------------+---------------+
    
    
    cannot resolve '`oo.is_old`' given input columns: [id, name]; line 5 pos 9;
    'Sort ['oo.is_old ASC NULLS FIRST], true
    +- Distinct
       +- Union
          :- Project [id#121, name#122]
          :  +- SubqueryAlias oo
          :     +- SubqueryAlias old_orders
          :        +- LogicalRDD [id#121, name#122, is_old#123]
          +- Project [id#131, name#132]
             +- SubqueryAlias no
                +- SubqueryAlias new_orders
                   +- LogicalRDD [id#131, name#132, is_old#133]
    
    org.apache.spark.sql.AnalysisException: cannot resolve '`oo.is_old`' given input columns: [id, name]; line 5 pos 9;
    'Sort ['oo.is_old ASC NULLS FIRST], true
    +- Distinct
       +- Union
          :- Project [id#121, name#122]
          :  +- SubqueryAlias oo
          :     +- SubqueryAlias old_orders
          :        +- LogicalRDD [id#121, name#122, is_old#123]
          +- Project [id#131, name#132]
             +- SubqueryAlias no
                +- SubqueryAlias new_orders
                   +- LogicalRDD [id#131, name#132, is_old#133]

因此,如果我不执行 UNION,则按不在 SELECT 子句中的列进行排序是有效的,如果我执行两个表的 UNION,则失败。

Spark SQL的语法与SQL非常相似,但它们的工作方式却大不相同。在Spark的引擎盖下,它都是关于Rdds/数据帧的。

在 UNION 语句之后,将生成一个新的数据帧,如果我们没有选择旧表/数据帧中的字段,我们将无法引用它们。

如何解决:

spark.sql(
      """
        |SELECT id, name
        |FROM (
        | SELECT oo.id, oo.name, oo.is_old FROM old_orders oo
        | UNION
        | SELECT no.id, no.name, no.is_old FROM new_orders no
        | ORDER BY oo.is_old
        | ) t
      """.stripMargin).show()

谢谢。

相关内容

  • 没有找到相关文章

最新更新