我有一个数据帧,其中包含每个联系人的每个选件的分数。我想从中创建一个新的数据帧,其中包含每个联系人的前 3 个优惠。
输入数据帧如下所示:
=======================================================================
| contact | offer 1 | offer 2 | offer 3 | offer 4 | offer 5 | offer 6 |
=======================================================================
| name 1 | 0 | 3 | 1 | 2 | 1 | 6 |
-----------------------------------------------------------------------
| name 2 | 1 | 7 | 2 | 9 | 5 | 3 |
-----------------------------------------------------------------------
我想像这样将其转换为数据帧:
===============================================================
| contact | best offer | second best offer | third best offer |
===============================================================
| name 1 | offer 6 | offer 2 | offer 4 |
---------------------------------------------------------------
| name 1 | offer 4 | offer 2 | offer 5 |
---------------------------------------------------------------
您需要一些导入:
from pyspark.sql.functions import array, col, lit, sort_array, struct
数据如问题所示:
df = sc.parallelize([
("name 1", 0, 3, 1, 2, 1, 6),
("name 2", 1, 7, 2, 9, 5, 3),
]).toDF(["contact"] + ["offer_{}".format(i) for i in range(1, 7)])
您可以组合和排序structs
数组:
offers = sort_array(array(*[
struct(col(c).alias("v"), lit(c).alias("k")) for c in df.columns[1:]
]), asc=False)
和select
:
df.select(
["contact"] + [offers[i]["k"].alias("_{}".format(i)) for i in [0, 1, 2]])
这应该给出以下结果:
+-------+-------+-------+-------+
|contact| _0| _1| _2|
+-------+-------+-------+-------+
| name 1|offer_6|offer_2|offer_4|
| name 2|offer_4|offer_2|offer_5|
+-------+-------+-------+-------+
根据需要重命名列,即可开始使用。