Pyspark - Transpose



在Pyspark中有如下数据集

+-----------+-----------+                                                       
|weekend_day|totals     |
+-----------+-----------+
| 2023-02-25|  401943676|
| 2023-03-11|  410220150|
+-----------+-----------+

,期望输出为

-----------------------------------
|        | 2023-02-25 | 2023-03-11 |
| totals | 401943676  | 410220150  |

pivot没有提供结果。请建议如何实现?

请注意,我不想使用Pandas

谢谢

不确定pivot不提供结果是什么意思?

df = spark.createDataFrame(
[('2023-02-25', 401943676), ('2023-03-11', 410220150)],
schema=['weekend_day', 'totals']
)
df.printSchema()
df.show(3, False)
+-----------+---------+
|weekend_day|totals   |
+-----------+---------+
|2023-02-25 |401943676|
|2023-03-11 |410220150|
+-----------+---------+

您可以使用groupBypivot来实现预期输出:从pyspark。SQL import functions as func

df.groupBy(
func.lit('total').alias('col_name')
).pivot(
'weekend_day'
).agg(
func.first('totals')
).show(
10, False
)
+--------+----------+----------+
|col_name|2023-02-25|2023-03-11|
+--------+----------+----------+
|total   |401943676 |410220150 |
+--------+----------+----------+

另一种实现相同目的的方法是:

weekend_days = df.select("weekend_day").distinct().rdd.flatMap(lambda x: x).collect()
transformed_df = df.groupBy(lit("totals").alias("columnName")).agg(*[sum(when(col("weekend_day") == day, col("totals"))).alias(day) for day in weekend_days])
transformed_df.show()
+----------+----------+----------+
|columnName|2023-02-25|2023-03-11|
+----------+----------+----------+
|    totals| 401943676| 410220150|
+----------+----------+----------+

相关内容

  • 没有找到相关文章

最新更新