将带有嵌套结构的数组与Pyspark DataFrame的其他列一起转换为字符串列



这类似于pyspark:带有嵌套结构到字符串

但是,接受的答案对我的案件不起作用,所以在这里询问

|-- Col1: string (nullable = true)
|-- Col2: array (nullable = true)
    |-- element: struct (containsNull = true)
          |-- Col2Sub: string (nullable = true)

样本JSON

{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}

这给出了单列的结果

import pyspark.sql.functions as F
df.selectExpr("EXPLODE(Col2) AS structCol").select(F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show()
    +----------------+
    | Col2_concated  |
    +----------------+
    |foo,bar         |
    +----------------+

但是,如何获得结果或数据框架

+-------+---------------+
|Col1   | Col2_concated |
+-------+---------------+
|abc123 |foo,bar        |
+-------+---------------+

编辑:该解决方案给出了错误的结果

df.selectExpr("Col1","EXPLODE(Col2) AS structCol").select("Col1", F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show() 

+-------+---------------+
|Col1   | Col2_concated |
+-------+---------------+
|abc123 |foo            |
+-------+---------------+
|abc123 |bar            |
+-------+---------------+

只需避免爆炸,就已经在那里。您需要的只是Concat_ws函数。此函数将多个字符串列与给定的分离器连接在一起。请参阅下面的示例:

from pyspark.sql import functions as F
j = '{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}'
df = spark.read.json(sc.parallelize([j]))
#printSchema tells us the column names we can use with concat_ws                                                                              
df.printSchema()

输出:

root
 |-- Col1: string (nullable = true)
 |-- Col2: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Col2Sub: string (nullable = true)

列Col2是Col2Sub的一个数组,我们可以使用此列名来获得所需的结果:

bla = df.withColumn('Col2', F.concat_ws(',', df.Col2.Col2Sub))
bla.show()
+------+-------+                                                                
|  Col1|   Col2|
+------+-------+
|abc123|foo,bar|
+------+-------+

相关内容

  • 没有找到相关文章

最新更新