我希望将每个记录转换为pyspark数据框中的多个列。
这是我的数据框架:
+--------+-------------+--------------+------------+------+
|level_1 |level_2 |level_3 |level_4 |UNQ_ID|
+--------+-------------+--------------+------------+------+
|D Group|Investments |ORB |ECM |1 |
|E Group|Investment |Origination |Execution |2 |
+--------+-------------+--------------+------------+------+
要求的数据框架是:
+--------+---------------+------+
|level |name |UNQ_ID|
+--------+---------------+------+
|level_1 |D Group |1 |
|level_1 |E Group |2 |
|level_2 |Investments |1 |
|level_2 |Investment |2 |
|level_3 |ORB |1 |
|level_3 |Origination |2 |
|level_4 |ECM |1 |
|level_4 |Execution |2 |
+--------+---------------+------+
使用堆栈函数的更简单的方法:
import pyspark.sql.functions as f
output_df = df.selectExpr('stack(4, "level_1", level_1, "level_2", level_2, "level_3", level_3, "level_4", level_4) as (level, name)', 'UNQ_ID')
output_df.show()
# +-------+-----------+------+
# | level| name|UNQ_ID|
# +-------+-----------+------+
# |level_1| D Group| 1|
# |level_2|Investments| 1|
# |level_3| ORB| 1|
# |level_4| ECM| 1|
# |level_1| E Group| 2|
# |level_2|Investments| 2|
# |level_3|Origination| 2|
# |level_4| Execution| 2|
# +-------+-----------+------+