我正在尝试读取嵌套的JSON文件。我无法分解嵌套列并正确读取JSON文件。
My Json file
{
"Univerity": "JNTU",
"Department": {
"DepartmentID": "101",
"Student": {
"lastName": "Fraun",
"address": "23 hyd 500089",
"email": "ss.fraun@yahoo.co.in",
"Subjects": [
{
"subjectId": "12592",
"subjectName": "Boyce"
},
{
"subjectId": "12592",
"subjectName": "Boyce"
}
]
}
}
}
Code :
```
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.functions import col, regexp_replace, split
spark = SparkSession.builder.appName('Reminder').master('local').getOrCreate()
if __name__ == '__main__':
df = spark.read.option("multiline","true").json("C:Workspacestudent1.json").cache()
df.show()
df.printSchema()
df.withColumn("Department", explode(col("Department")))
df.show()
```
我的以下输出和错误:+--------------------+---------+|系|大学|+--------------------+---------+|{101,{〔{12592,B…| JNTU|+--------------------+---------+
root
|-- Department: struct (nullable = true)
| |-- DepartmentID: string (nullable = true)
| |-- Student: struct (nullable = true)
| | |-- Subjects: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- subjectId: string (nullable = true)
| | | | |-- subjectName: string (nullable = true)
| | |-- address: string (nullable = true)
| | |-- email: string (nullable = true)
| | |-- lastName: string (nullable = true)
|-- Univerity: string (nullable = true)
Traceback (most recent call last):
File "C:/student/agility-data-electrode/electrode/entities/student.py", line 12, in <module>
df.withColumn("Department", explode(col("Department")))
File "C:Workspaceanaconda3envsstudentpysparklibsite-packagespysparksqldataframe.py", line 2455, in withColumn
return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
File "C:Workspaceanaconda3envsstudentpysparklibsite-packagespy4jjava_gateway.py", line 1310, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:Workspaceanaconda3envsstudentpysparklibsite-packagespysparksqlutils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: cannot resolve 'explode(`Department`)' due to data type mismatch: input to function explode should be array or map type, not struct<DepartmentID:string,Student:struct<Subjects:array<struct<subjectId:string,subjectName:string>>,address:string,email:string,lastName:string>>;
'Project [explode(Department#0) AS Department#65, Univerity#1]
+- Relation[Department#0,Univerity#1] json
您只能分解数组列,请选择要分解的主题列。
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark.sql import functions as F
from pyspark.sql.functions import col, regexp_replace, split
spark = SparkSession.builder.appName('Reminder').master('local').getOrCreate()
if __name__ == '__main__':
df = spark.read.option("multiline","true").json("C:Workspacestudent1.json").cache()
df.show()
df.printSchema()
df.withColumn("Subjects", explode(col("Department.Student.Subjects")))
df.show()