Pyspark - Insert List to Dataframe Cell



我有一个字典,我想将字典的键列表添加到数据框中列的每个单元格中。目前我的尝试还没有成功,我不知道为什么。

字典是这样的:

my_dict = {"A":"1","B":"2","C":"3","D":"4"}

我想将my_dict的键值添加到数据帧列中,因此最终结果如下所示:

+------------+------------+------------+
|       comb1|       comb2|        colA|
+------------+------------+------------+
|          YY|         XX |[A, B, C, D]|
+------------+------------+------------+

目标是下一步是一个爆炸,所以数据帧看起来像这样:

+------------+------------+------------+
|       comb1|       comb2|        colA|
+------------+------------+------------+
|          YY|         XX |           A|
+------------+------------+------------+
|          YY|         XX |           B|
+------------+------------+------------+
|          YY|         XX |           C|
+------------+------------+------------+
|          YY|         XX |           D|
+------------+------------+------------+

如何在列的每一行插入字典键,然后爆炸?

您可以使用字典的键创建一些额外的常量列,然后从中创建一个数组,最后将该列爆炸。

代码比解释简单:

from pyspark.sql import functions as F
# create temporary constant columns with the keys of the dictionary
for k in my_dict.keys():
df = df.withColumn(f'_temp_{k}', F.lit(k))

df = (
df
# add a column with an array collecting all the keys
.withColumn('colA', F.array(*[f'_temp_{k}' for k in my_dict.keys()]))
# drop the temporary columns
.drop(*[f'_temp_{k}' for k in my_dict.keys()])
# explode the column with the array
.withColumn('colA', F.explode(F.col('colA')))
)

所得df为:

+-----+-----+----+
|comb1|comb2|colA|
+-----+-----+----+
|   YY|   XX|   A|
|   YY|   XX|   B|
|   YY|   XX|   C|
|   YY|   XX|   D|
+-----+-----+----+

最新更新