spark用json数组将列分解为行



如何定义json数组的模式,以便将其分解成行?

我有一个UDF,它返回一个字符串(json数组(,我想把数组中的项分解成行,然后保存它。

示例json

[{"name":"name1","id":"1"}, {"name":"name2","id":"2"}]

我被Schema定义为

// schema of each item in the array
StructType arrayItemDataType= new StructType(new structfiled("name"), new structfiled(id));
var rootType = new ArrayType(itemDataType); // array of items
var field = new StructField(name: "MyJson", dataType: rootType, isNullable: false);
StructType schema = new StructType(new structField(field));

df.select(from_json('mycol', schema))

爆炸后tt创建类似的模式

root
|-- col: struct (nullable = true)
|    |-- name: string
|    |-- id: string

我在本地集群上运行并写入csv,我希望在爆炸后它应该有两列名称和id的数据帧,我可以在csv中写入所有行。当我运行它时,它没有将df模式创建为名称、id,并且无法写入csv,并显示消息"csv不支持struct<">

Sample DataFrame:

%python
list=[['[{"name":"name1","id":"1"}, {"name":"name2","id":"2"}]']]
df=spark.createDataFrame(list,['mycol'])
df.show(truncate=False)
#df:pyspark.sql.dataframe.DataFrame = [mycol: string]
#+------------------------------------------------------+
#|mycol                                                 |
#+------------------------------------------------------+
#|[{"name":"name1","id":"1"}, {"name":"name2","id":"2"}]|
#+------------------------------------------------------+

您可以使用name of column.*选择structall fields

from pyspark.sql.types import *
from pyspark.sql import functions as F
json_schema=ArrayType(StructType([ StructField("name", StringType()), StructField("id", StringType())]))
df.withColumn("json",F.explode(F.from_json("mycol",json_schema)))
.select("json.*").show()
#+-----+---+
#| name| id|
#+-----+---+
#|name1|  1|
#|name2|  2|
#+-----+---+

最新更新