我正在尝试将SQL字符串转换为Spark DataFrame API格式。然而,当我使用DataFrame API时,我得到了不同的结果。有人能告诉我我做错了什么吗?具体来说,我想知道为什么我会得到不同的记录。我不介意不同的布局。
SQL代码
spark.sql("""SELECT delay, origin, destination,
CASE
WHEN delay > 360 THEN 'Very Long Delays'
WHEN delay > 120 AND delay < 360 THEN 'Long Delays '
WHEN delay > 60 AND delay < 120 THEN 'Short Delays'
WHEN delay > 0 and delay < 60 THEN 'Tolerable Delays'
WHEN delay = 0 THEN 'No Delays'
ELSE 'No Delays'
END AS Flight_Delays
FROM us_delay_flights_tbl
ORDER BY origin, delay DESC""").show(10, truncate=False)
Spark DataFrame API代码
df.select("delay", "origin", "destination",
F.when(col("delay") > 360, "Very Long Delays")
.when((col("delay") > 120) & (col("delay") < 360), "Long Delays")
.when((col("delay") > 60) & (col("delay") < 120), "Short Delays")
.when((col("delay") > 0) & (col("delay") < 60), "Tolerable Delays")
.when(col("delay") == 0, "No Delays").otherwise("No Delays"))
.orderBy("origin", "delay", ascending=False).show(10)
我从SQL字符串中得到的结果(我想要的(
|delay|origin|destination|Flight_Delays|
+-----+------+-----------+-------------+
|333 |ABE |ATL |Long Delays |
|305 |ABE |ATL |Long Delays |
|275 |ABE |ATL |Long Delays |
|257 |ABE |ATL |Long Delays |
|247 |ABE |ATL |Long Delays |
|247 |ABE |DTW |Long Delays |
|219 |ABE |ORD |Long Delays |
|211 |ABE |ATL |Long Delays |
|197 |ABE |DTW |Long Delays |
|192 |ABE |ORD |Long Delays |
+-----+------+-----------+-------------+
only showing top 10 rows
我从DataFrame API获得的结果(这不是我想要的(
| 475| YUM| PHX|Very Long Delays|
| 347| YUM| LAX|Long Delays|
| 333| YUM| LAX|Long Delays|
| 285| YUM| PHX|Long Delays|
| 267| YUM| LAX|Long Delays|
| 236| YUM| LAX|Long Delays|
| 231| YUM| LAX|Long Delays|
我看到您的 降序如果您想要一个与SQL查询相同的带有DataFrame API的示例结果,那么您可以将其修改如下: 您可以参考此链接以获得更好的想法:http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=take#pyspark.sql.DataFrame.orderBySQL query
中提到了ORDER BY origin, delay DESC
这意味着用origin
按升序排序,用delay
按降序顺序排序。在DataFrame
API版本中,在该版本中,您将ascending
标志设置为False
,这意味着以降序的方式按origin
列排序,以及以orderBy(["origin", "delay"], ascending=[1, 0])