Pyspark 'from_json',所有 json 列的数据帧返回空值



使用python(3.7.12版本)和pyspark(2.4.0版本)。

我试图使用from_json语句使用列和已识别的模式。但是,df返回null。我假设我错误地识别了列的模式和类型。

以下代码是我使用get_json_object从表中提取的json字符串:

df = df.select(col('id'), get_json_object(col("pulled_col"), "$.data"))
df.head()
#Row(id = '0123456', data = '[
#{"time" : [], "history" : [], "zip" : "78910", "phnumber" : #"5678910123", "name" : "-"},
#{"time" : [], "history" : [], "zip" : "78920", "phnumber" : #"5678910123", "name" : "-"},
#{"time" : [], "history" : [], "zip" : "78930", "phnumber" : #"5678910123", "name" : "-"},
#{"time" : [], "history" : [], "zip" : "78910", "phnumber" : #"5678910123", "name" : "-"}
#]')
df.printSchema()
#root
# |-- id: string (nullable = true)
# |-- data: string (nullable = true)
df.show()
#+-------+----------------------------+
#|     id|                        data|
#+-------+----------------------------+
#|0123456|[{"time" : [], "history"....|
#|0123456|[{"time" : [], "history"....|
#+-------+----------------------------+
test = df.select(col("id"), get_json_object(col("data"),"$.zip")
.alias("zip"))
.show(truncate=False)
# The output shouldn't be null?
#+-------+----+
#|     id| zip|
#+-------+----+
#|0123456|null|
#|0123456|null|
#+-------+----+
schema = StructType(
[
StructField('zip', StringType(), True),
StructField('phnumber', StringType(), True),
StructField('name', StringType(), True)
]
)
data_json = df.withColumn("data", from_json("data", schema))
.select(col('id'), col('data.*'))
# The df output shouldn't be null for the new json schema?
data_json.show()
#+-------+----+---------+-----+
#|     id| zip| phnumber| name|
#+-------+----+---------+-----+
#|0123456|null|     null| null|
#|0123456|null|     null| null|
#+-------+----+---------+-----+

data列实际上包含一个json数组所以模式必须是一个ArrayType:

schema = ArrayType(
elementType = StructType(
[
StructField('zip', StringType(), True),
StructField('phnumber', StringType(), True),
StructField('name', StringType(), True)
]
)
)
data_json = df.withColumn("data", F.from_json("data", schema))

的结果是:

root
|-- id: long (nullable = true)
|-- data: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- zip: string (nullable = true)
|    |    |-- phnumber: string (nullable = true)
|    |    |-- name: string (nullable = true)

现在,如果您希望数组中的每个元素在单独的行中,您可以将其展开并提取所需的字段:

data_json = df.withColumn("data", F.from_json("data", schema)) 
.withColumn("data", F.explode("data")) 
.select(F.col('id'), F.col('data.*'))

结果:

+---+-----+----------+----+
| id|  zip|  phnumber|name|
+---+-----+----------+----+
|  1|78910|5678910123|   -|
|  1|78920|5678910123|   -|
+---+-----+----------+----+

最新更新