我有一个字典,我想将字典的键列表添加到数据框中列的每个单元格中。目前我的尝试还没有成功,我不知道为什么。
字典是这样的:
my_dict = {"A":"1","B":"2","C":"3","D":"4"}
我想将my_dict
的键值添加到数据帧列中,因此最终结果如下所示:
+------------+------------+------------+
| comb1| comb2| colA|
+------------+------------+------------+
| YY| XX |[A, B, C, D]|
+------------+------------+------------+
目标是下一步是一个爆炸,所以数据帧看起来像这样:
+------------+------------+------------+
| comb1| comb2| colA|
+------------+------------+------------+
| YY| XX | A|
+------------+------------+------------+
| YY| XX | B|
+------------+------------+------------+
| YY| XX | C|
+------------+------------+------------+
| YY| XX | D|
+------------+------------+------------+
如何在列的每一行插入字典键,然后爆炸?
您可以使用字典的键创建一些额外的常量列,然后从中创建一个数组,最后将该列爆炸。
代码比解释简单:
from pyspark.sql import functions as F
# create temporary constant columns with the keys of the dictionary
for k in my_dict.keys():
df = df.withColumn(f'_temp_{k}', F.lit(k))
df = (
df
# add a column with an array collecting all the keys
.withColumn('colA', F.array(*[f'_temp_{k}' for k in my_dict.keys()]))
# drop the temporary columns
.drop(*[f'_temp_{k}' for k in my_dict.keys()])
# explode the column with the array
.withColumn('colA', F.explode(F.col('colA')))
)
所得df
为:
+-----+-----+----+
|comb1|comb2|colA|
+-----+-----+----+
| YY| XX| A|
| YY| XX| B|
| YY| XX| C|
| YY| XX| D|
+-----+-----+----+