正在读取Pyspark代码中的嵌套Json文件.pyspark.sql.utils.AnalysisException:



我正在尝试读取嵌套的JSON文件。我无法分解嵌套列并正确读取JSON文件。

My Json file
{
"Univerity": "JNTU",
"Department": {
"DepartmentID": "101",
"Student": {
"lastName": "Fraun",
"address": "23 hyd 500089",
"email": "ss.fraun@yahoo.co.in",
"Subjects": [
{
"subjectId": "12592",
"subjectName": "Boyce"            
},
{
"subjectId": "12592",
"subjectName": "Boyce"
}
]
}
}
}
Code :
```
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.functions import col, regexp_replace, split
spark = SparkSession.builder.appName('Reminder').master('local').getOrCreate()
if __name__ == '__main__':
df = spark.read.option("multiline","true").json("C:Workspacestudent1.json").cache()
df.show()
df.printSchema()
df.withColumn("Department", explode(col("Department")))
df.show()
```

我的以下输出和错误:+--------------------+---------+|系|大学|+--------------------+---------+|{101,{〔{12592,B…| JNTU|+--------------------+---------+

root
|-- Department: struct (nullable = true)
|    |-- DepartmentID: string (nullable = true)
|    |-- Student: struct (nullable = true)
|    |    |-- Subjects: array (nullable = true)
|    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |-- subjectId: string (nullable = true)
|    |    |    |    |-- subjectName: string (nullable = true)
|    |    |-- address: string (nullable = true)
|    |    |-- email: string (nullable = true)
|    |    |-- lastName: string (nullable = true)
|-- Univerity: string (nullable = true)
Traceback (most recent call last):
File "C:/student/agility-data-electrode/electrode/entities/student.py", line 12, in <module>
df.withColumn("Department", explode(col("Department")))
File "C:Workspaceanaconda3envsstudentpysparklibsite-packagespysparksqldataframe.py", line 2455, in withColumn
return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
File "C:Workspaceanaconda3envsstudentpysparklibsite-packagespy4jjava_gateway.py", line 1310, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:Workspaceanaconda3envsstudentpysparklibsite-packagespysparksqlutils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: cannot resolve 'explode(`Department`)' due to data type mismatch: input to function explode should be array or map type, not struct<DepartmentID:string,Student:struct<Subjects:array<struct<subjectId:string,subjectName:string>>,address:string,email:string,lastName:string>>;
'Project [explode(Department#0) AS Department#65, Univerity#1]
+- Relation[Department#0,Univerity#1] json

您只能分解数组列,请选择要分解的主题列。

from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.functions import col, regexp_replace, split
spark = SparkSession.builder.appName('Reminder').master('local').getOrCreate()
if __name__ == '__main__':
df = spark.read.option("multiline","true").json("C:Workspacestudent1.json").cache()
df.show()
df.printSchema()
df.withColumn("Subjects", explode(col("Department.Student.Subjects")))
df.show()

最新更新