Spark DataFrame API不会产生与Spark SQL版本相同的结果



我正在尝试将SQL字符串转换为Spark DataFrame API格式。然而,当我使用DataFrame API时,我得到了不同的结果。有人能告诉我我做错了什么吗?具体来说,我想知道为什么我会得到不同的记录。我不介意不同的布局。

SQL代码

spark.sql("""SELECT delay, origin, destination,
CASE
WHEN delay > 360 THEN 'Very Long Delays'
WHEN delay > 120 AND delay < 360 THEN  'Long Delays '
WHEN delay > 60 AND delay < 120 THEN  'Short Delays'
WHEN delay > 0 and delay < 60  THEN   'Tolerable Delays'
WHEN delay = 0 THEN 'No Delays'
ELSE 'No Delays'
END AS Flight_Delays
FROM us_delay_flights_tbl
ORDER BY origin, delay DESC""").show(10, truncate=False)

Spark DataFrame API代码

df.select("delay", "origin", "destination", 
F.when(col("delay") > 360, "Very Long Delays")
.when((col("delay") > 120) & (col("delay") < 360), "Long Delays")
.when((col("delay") > 60) & (col("delay") < 120), "Short Delays")
.when((col("delay") > 0) & (col("delay") < 60), "Tolerable Delays")
.when(col("delay") == 0, "No Delays").otherwise("No Delays"))
.orderBy("origin", "delay", ascending=False).show(10)

我从SQL字符串中得到的结果(我想要的(

|delay|origin|destination|Flight_Delays|
+-----+------+-----------+-------------+
|333  |ABE   |ATL        |Long Delays  |
|305  |ABE   |ATL        |Long Delays  |
|275  |ABE   |ATL        |Long Delays  |
|257  |ABE   |ATL        |Long Delays  |
|247  |ABE   |ATL        |Long Delays  |
|247  |ABE   |DTW        |Long Delays  |
|219  |ABE   |ORD        |Long Delays  |
|211  |ABE   |ATL        |Long Delays  |
|197  |ABE   |DTW        |Long Delays  |
|192  |ABE   |ORD        |Long Delays  |
+-----+------+-----------+-------------+
only showing top 10 rows

我从DataFrame API获得的结果(这不是我想要的(

|  475|   YUM|        PHX|Very Long Delays|
|  347|   YUM|        LAX|Long Delays|
|  333|   YUM|        LAX|Long Delays|
|  285|   YUM|        PHX|Long Delays|
|  267|   YUM|        LAX|Long Delays|
|  236|   YUM|        LAX|Long Delays|
|  231|   YUM|        LAX|Long Delays|

我看到您的SQL query中提到了ORDER BY origin, delay DESC这意味着用origin升序排序,用delay降序顺序排序。在DataFrameAPI版本中,在该版本中,您将ascending标志设置为False,这意味着以降序的方式按origin列排序,以及以

降序如果您想要一个与SQL查询相同的带有DataFrame API的示例结果,那么您可以将其修改如下:

orderBy(["origin", "delay"], ascending=[1, 0]) 

您可以参考此链接以获得更好的想法:http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=take#pyspark.sql.DataFrame.orderBy

最新更新