在Pyspark中有如下数据集
+-----------+-----------+
|weekend_day|totals |
+-----------+-----------+
| 2023-02-25| 401943676|
| 2023-03-11| 410220150|
+-----------+-----------+
,期望输出为
-----------------------------------
| | 2023-02-25 | 2023-03-11 |
| totals | 401943676 | 410220150 |
pivot没有提供结果。请建议如何实现?
请注意,我不想使用Pandas
谢谢
不确定pivot
不提供结果是什么意思?
df = spark.createDataFrame(
[('2023-02-25', 401943676), ('2023-03-11', 410220150)],
schema=['weekend_day', 'totals']
)
df.printSchema()
df.show(3, False)
+-----------+---------+
|weekend_day|totals |
+-----------+---------+
|2023-02-25 |401943676|
|2023-03-11 |410220150|
+-----------+---------+
您可以使用groupBy
和pivot
来实现预期输出:从pyspark。SQL import functions as func
df.groupBy(
func.lit('total').alias('col_name')
).pivot(
'weekend_day'
).agg(
func.first('totals')
).show(
10, False
)
+--------+----------+----------+
|col_name|2023-02-25|2023-03-11|
+--------+----------+----------+
|total |401943676 |410220150 |
+--------+----------+----------+
另一种实现相同目的的方法是:
weekend_days = df.select("weekend_day").distinct().rdd.flatMap(lambda x: x).collect()
transformed_df = df.groupBy(lit("totals").alias("columnName")).agg(*[sum(when(col("weekend_day") == day, col("totals"))).alias(day) for day in weekend_days])
transformed_df.show()
+----------+----------+----------+
|columnName|2023-02-25|2023-03-11|
+----------+----------+----------+
| totals| 401943676| 410220150|
+----------+----------+----------+