将每条记录转置到pyspark数据框中的多个列中



我希望将每个记录转换为pyspark数据框中的多个列。

这是我的数据框架:

+--------+-------------+--------------+------------+------+
|level_1 |level_2      |level_3       |level_4     |UNQ_ID|
+--------+-------------+--------------+------------+------+
|D  Group|Investments  |ORB           |ECM         |1     |
|E  Group|Investment   |Origination   |Execution   |2     |
+--------+-------------+--------------+------------+------+

要求的数据框架是:

+--------+---------------+------+
|level   |name           |UNQ_ID|
+--------+---------------+------+
|level_1 |D  Group       |1     |
|level_1 |E  Group       |2     |
|level_2 |Investments    |1     |
|level_2 |Investment     |2     |
|level_3 |ORB            |1     |
|level_3 |Origination    |2     |
|level_4 |ECM            |1     |
|level_4 |Execution      |2     |
+--------+---------------+------+

使用堆栈函数的更简单的方法:

import pyspark.sql.functions as f
output_df = df.selectExpr('stack(4, "level_1", level_1, "level_2", level_2, "level_3", level_3, "level_4", level_4) as (level, name)', 'UNQ_ID')
output_df.show()
# +-------+-----------+------+
# |  level|       name|UNQ_ID|
# +-------+-----------+------+
# |level_1|    D Group|     1|
# |level_2|Investments|     1|
# |level_3|        ORB|     1|
# |level_4|        ECM|     1|
# |level_1|    E Group|     2|
# |level_2|Investments|     2|
# |level_3|Origination|     2|
# |level_4|  Execution|     2|
# +-------+-----------+------+

最新更新