我有一个数据帧,其值低于
customer_hash count_beautyhygiene_l3_decile net_paid_amount_l12_decile unique_days_l12_decile
1234 1 3 1
5678 2 3 4
1257 3 2 2
我使用以下代码为每个客户实现键值_ hash
df = df.groupBy("customer_hash").agg(collect_list(struct("count_beautyhygiene_l3_decile","net_paid_amount_l12_decile","unique_days_l12_decile")).alias('brandVariable'))
从上面的查询中,得到下面的结果
customer_hash brandVariable
1234 [{"count_beautyhygiene_l3_decile": 1,
"net_paid_amount_l12_decile": 3,
"unique_days_l12_decile": 1}]
5678 [{"count_beautyhygiene_l3_decile": 2,
"net_paid_amount_l12_decile": 3,
"unique_days_l12_decile": 4}]
1257 [{"count_beautyhygiene_l3_decile": 3,
"net_paid_amount_l12_decile": 2,
"unique_days_l12_decile": 2}]
但我的要求是生成如下格式的输出
customer_hash brandVariable
1234 [{
"NAME": "count_beautyhygiene_l3_decile",
"VALUE": "1"
},
{
"NAME": "net_paid_amount_l12_decile",
"VALUE": "3"
},
{
"NAME": "unique_days_l12_decile",
"VALUE": "1"
}]
5678 [{
"NAME": "count_beautyhygiene_l3_decile",
"VALUE": "2"
},
{
"NAME": "net_paid_amount_l12_decile",
"VALUE": "3"
},
{
"NAME": "unique_days_l12_decile",
"VALUE": "4"
}]...so on
如何实现所需的输出?
尝试使用以下方法-
输入数据
data=[(1234,1,3,1),(5678,2,3,4),(1257,3,2,2)]
schema = ["customer_hash","count_beautyhygiene_l3_decile","net_paid_amount_l12_decile","unique_days_l12_decile"]
df = spark.createDataFrame(data=data,schema=schema)
df.show()
+-------------+-----------------------------+--------------------------+----------------------+
|customer_hash|count_beautyhygiene_l3_decile|net_paid_amount_l12_decile|unique_days_l12_decile|
+-------------+-----------------------------+--------------------------+----------------------+
| 1234| 1| 3| 1|
| 5678| 2| 3| 4|
| 1257| 3| 2| 2|
+-------------+-----------------------------+--------------------------+----------------------+
所需输出:
from pyspark.sql.functions import *
from pyspark.sql.types import *
(df.select("customer_hash", to_json(struct("count_beautyhygiene_l3_decile", "net_paid_amount_l12_decile", "unique_days_l12_decile")).alias("temp"))
.select("customer_hash", from_json("temp", MapType(StringType(), IntegerType())))
.select("customer_hash", explode("entries").alias("NAME", "VALUES"))
.select("customer_hash", to_json(struct("NAME", "VALUES")).alias("temp2"))
.groupBy("customer_hash").agg(collect_list("temp2").alias("brandVariable"))
).show(truncate=False)
+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|customer_hash|brandVariable |
+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|1234 |[{"NAME":"count_beautyhygiene_l3_decile","VALUES":1}, {"NAME":"net_paid_amount_l12_decile","VALUES":3}, {"NAME":"unique_days_l12_decile","VALUES":1}]|
|5678 |[{"NAME":"count_beautyhygiene_l3_decile","VALUES":2}, {"NAME":"net_paid_amount_l12_decile","VALUES":3}, {"NAME":"unique_days_l12_decile","VALUES":4}]|
|1257 |[{"NAME":"count_beautyhygiene_l3_decile","VALUES":3}, {"NAME":"net_paid_amount_l12_decile","VALUES":2}, {"NAME":"unique_days_l12_decile","VALUES":2}]|
+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+