我想使用数据帧中的列表元素频率制作一个表
示例(
原样
+----+---------------+
| id | data |
+----+---------------+
|a |[1,2,3,4,5] |
|b |[2,2,4,5] |
|c |[56,7,1,1,1] |
+----+---------------+
成为
+----+-----+-----+-----+-----+-----+-----+-----+
| id | 1 | 2 | 3 | 4 | 5 | 7 | 56 |
+----+-----+-----+-----+-----+-----+-----+-----+
|a | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
|b | 0 | 2 | 0 | 1 | 1 | 0 | 0 |
|c | 3 | 0 | 0 | 0 | 0 | 1 | 1 |
+----+-----+-----+-----+-----+-----+-----+-----+
我该如何制作";"照原样";至";成为";?
一种可能的方法是先explode
数组,然后pivot
分解值。
# input data
data_sdf.show()
# +---+----------------+
# | id| data|
# +---+----------------+
# | a| [1, 2, 3, 4, 5]|
# | b| [2, 2, 4, 5]|
# | c|[56, 7, 1, 1, 1]|
# +---+----------------+
data_sdf.
withColumn('data_explode', func.explode('data')).
groupBy('id').
pivot('data_explode').
count().
fillna(0).
show()
# +---+---+---+---+---+---+---+---+
# | id| 1| 2| 3| 4| 5| 7| 56|
# +---+---+---+---+---+---+---+---+
# | c| 3| 0| 0| 0| 0| 1| 1|
# | b| 0| 2| 0| 1| 1| 0| 0|
# | a| 1| 1| 1| 1| 1| 0| 0|
# +---+---+---+---+---+---+---+---+