Pyspark - Insert List to Dataframe Cell - Pyspark - Insert List to Dataframe Cell 小贝子编程网

我有一个字典，我想将字典的键列表添加到数据框中列的每个单元格中。目前我的尝试还没有成功，我不知道为什么。

字典是这样的:

my_dict = {"A":"1","B":"2","C":"3","D":"4"}

我想将my_dict的键值添加到数据帧列中，因此最终结果如下所示:

+------------+------------+------------+
|       comb1|       comb2|        colA|
+------------+------------+------------+
|          YY|         XX |[A, B, C, D]|
+------------+------------+------------+

目标是下一步是一个爆炸，所以数据帧看起来像这样:

+------------+------------+------------+
|       comb1|       comb2|        colA|
+------------+------------+------------+
|          YY|         XX |           A|
+------------+------------+------------+
|          YY|         XX |           B|
+------------+------------+------------+
|          YY|         XX |           C|
+------------+------------+------------+
|          YY|         XX |           D|
+------------+------------+------------+

如何在列的每一行插入字典键，然后爆炸?

您可以使用字典的键创建一些额外的常量列，然后从中创建一个数组，最后将该列爆炸。

代码比解释简单:

from pyspark.sql import functions as F
# create temporary constant columns with the keys of the dictionary
for k in my_dict.keys():
df = df.withColumn(f'_temp_{k}', F.lit(k))

df = (
df
# add a column with an array collecting all the keys
.withColumn('colA', F.array(*[f'_temp_{k}' for k in my_dict.keys()]))
# drop the temporary columns
.drop(*[f'_temp_{k}' for k in my_dict.keys()])
# explode the column with the array
.withColumn('colA', F.explode(F.col('colA')))
)

所得df为:

+-----+-----+----+
|comb1|comb2|colA|
+-----+-----+----+
|   YY|   XX|   A|
|   YY|   XX|   B|
|   YY|   XX|   C|
|   YY|   XX|   D|
+-----+-----+----+

Pyspark - Insert List to Dataframe Cell

相关内容

最新更新

热门标签：