如何在pyspark中从数据帧创建多个键值对



我有一个数据帧,其值低于

customer_hash count_beautyhygiene_l3_decile net_paid_amount_l12_decile unique_days_l12_decile

1234            1                        3                       1
5678            2                        3                       4
1257            3                        2                       2

我使用以下代码为每个客户实现键值_ hash

df = df.groupBy("customer_hash").agg(collect_list(struct("count_beautyhygiene_l3_decile","net_paid_amount_l12_decile","unique_days_l12_decile")).alias('brandVariable'))

从上面的查询中,得到下面的结果

customer_hash      brandVariable
1234              [{"count_beautyhygiene_l3_decile": 1, 
"net_paid_amount_l12_decile": 3, 
"unique_days_l12_decile": 1}]
5678              [{"count_beautyhygiene_l3_decile": 2, 
"net_paid_amount_l12_decile": 3, 
"unique_days_l12_decile": 4}]      
1257              [{"count_beautyhygiene_l3_decile": 3, 
"net_paid_amount_l12_decile": 2, 
"unique_days_l12_decile": 2}]

但我的要求是生成如下格式的输出

customer_hash      brandVariable
1234              [{
"NAME": "count_beautyhygiene_l3_decile",      
"VALUE": "1"
},
{
"NAME": "net_paid_amount_l12_decile",           
"VALUE": "3"
},
{
"NAME": "unique_days_l12_decile",
"VALUE": "1"
}]

5678              [{
"NAME": "count_beautyhygiene_l3_decile",
"VALUE": "2"
},
{
"NAME": "net_paid_amount_l12_decile", 
"VALUE": "3"
},
{
"NAME": "unique_days_l12_decile",
"VALUE": "4"
}]...so on

如何实现所需的输出?

尝试使用以下方法-

输入数据

data=[(1234,1,3,1),(5678,2,3,4),(1257,3,2,2)]
schema = ["customer_hash","count_beautyhygiene_l3_decile","net_paid_amount_l12_decile","unique_days_l12_decile"]
df = spark.createDataFrame(data=data,schema=schema)
df.show()
+-------------+-----------------------------+--------------------------+----------------------+
|customer_hash|count_beautyhygiene_l3_decile|net_paid_amount_l12_decile|unique_days_l12_decile|
+-------------+-----------------------------+--------------------------+----------------------+
|         1234|                            1|                         3|                     1|
|         5678|                            2|                         3|                     4|
|         1257|                            3|                         2|                     2|
+-------------+-----------------------------+--------------------------+----------------------+

所需输出:

from pyspark.sql.functions import *
from pyspark.sql.types import *
(df.select("customer_hash", to_json(struct("count_beautyhygiene_l3_decile", "net_paid_amount_l12_decile", "unique_days_l12_decile")).alias("temp"))
.select("customer_hash", from_json("temp", MapType(StringType(), IntegerType())))
.select("customer_hash", explode("entries").alias("NAME", "VALUES"))
.select("customer_hash", to_json(struct("NAME", "VALUES")).alias("temp2"))
.groupBy("customer_hash").agg(collect_list("temp2").alias("brandVariable"))
).show(truncate=False)
+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|customer_hash|brandVariable                                                                                                                                        |
+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|1234         |[{"NAME":"count_beautyhygiene_l3_decile","VALUES":1}, {"NAME":"net_paid_amount_l12_decile","VALUES":3}, {"NAME":"unique_days_l12_decile","VALUES":1}]|
|5678         |[{"NAME":"count_beautyhygiene_l3_decile","VALUES":2}, {"NAME":"net_paid_amount_l12_decile","VALUES":3}, {"NAME":"unique_days_l12_decile","VALUES":4}]|
|1257         |[{"NAME":"count_beautyhygiene_l3_decile","VALUES":3}, {"NAME":"net_paid_amount_l12_decile","VALUES":2}, {"NAME":"unique_days_l12_decile","VALUES":2}]|
+-------------+-----------------------------------------------------------------------------------------------------------------------------------------------------+

最新更新