pyspark保存json处理null的结构



使用Pyspark和Spark 2.4,这里是Python3。在将数据框写入json文件时,如果结构列为空,我希望它被写入{},如果结构字段为空,我希望它为""。例如:

>>> df.printSchema()
root
|-- id: string (nullable = true)
|-- child1: struct (nullable = true)
|    |-- f_name: string (nullable = true)
|    |-- l_name: string (nullable = true)
|-- child2: struct (nullable = true)
|    |-- f_name: string (nullable = true)
|    |-- l_name: string (nullable = true)
>>> df.show()
+---+------------+------------+
| id|      child1|      child2|
+---+------------+------------+
|123|[John, Matt]|[Paul, Matt]|
|111|[Jack, null]|        null|
|101|        null|        null|
+---+------------+------------+
df.fillna("").coalesce(1).write.mode("overwrite").format('json').save('/home/test')

结果:


{"id":"123","child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"}}
{"id":"111","child1":{"f_name":"jack","l_name":""}}
{"id":"111"}

输出要求:


{"id":"123","child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"}}
{"id":"111","child1":{"f_name":"jack","l_name":""},"child2": {}}
{"id":"111","child1":{},"child2": {}}

我尝试了一些地图和udf的,但没有能够实现我需要的。谢谢你的帮助。

Spark 3.x

如果你将选项ignoreNullFields传递到你的代码中,你将得到这样的输出。不完全是一个空结构体,但模式仍然是正确的。

df.fillna("").coalesce(1).write.mode("overwrite").format('json').option('ignoreNullFields', False).save('/home/test')
{"child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"},"id":"123"}
{"child1":{"f_name":"Jack","l_name":null},"child2":null,"id":"111"}
{"child1":null,"child2":null,"id":"101"}

火花2. x

由于上面的选项不存在,我认为有一个"肮脏的修复"。为此,模仿JSON结构并绕过空检查。同样,结果并不完全像您要求的那样,但是模式是正确的。

(df
.select(F.struct(
F.col('id'),
F.coalesce(F.col('child1'), F.struct(F.lit(None).alias('f_name'), F.lit(None).alias('l_name'))).alias('child1'),
F.coalesce(F.col('child2'), F.struct(F.lit(None).alias('f_name'), F.lit(None).alias('l_name'))).alias('child2')
).alias('json'))
.coalesce(1).write.mode("overwrite").format('json').save('/home/test')
)
{"json":{"id":"123","child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"}}}
{"json":{"id":"111","child1":{"f_name":"Jack"},"child2":{}}}
{"json":{"id":"101","child1":{},"child2":{}}}

相关内容

  • 没有找到相关文章

最新更新