如何在pyspark中从模式json文件创建DataFrame模式?



我试图使用Pyspark从模式json文件创建DataFrame模式。一旦DataFrame模式创建,我将使用该模式加载json数据文件。有人能帮帮我吗?提前感谢。对于我的模式json文件如下所示:

[
{
"name": "visitorId",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "visitStartTime",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "totals",
"type": "RECORD",
"mode": "NULLABLE",
"fields": [
{
"name": "visits",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "hits",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "pageviews",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "transactions",
"type": "INTEGER",
"mode": "NULLABLE"
},
{
"name": "timeOnScreen",
"type": "INTEGER",
"mode": "NULLABLE"
}
]
},
{
"name": "channelGrouping",
"type": "STRING",
"mode": "NULLABLE"
}
]

您的模式没有按预期定义,pyspark无法解析它。

我已经把你的schema改成:

{
"type": "struct",
"fields": [
{
"name": "visitorId",
"type": "integer",
"nullable": true,
"metadata": {}
},
{
"name": "visitStartTime",
"type": "integer",
"nullable": true,
"metadata": {}
},
{
"name": "totals",
"type": {
"type": "struct",
"fields": [
{
"name": "visits",
"type": "integer",
"nullable": true,
"metadata": {}
},
{
"name": "hits",
"type": "integer",
"nullable": true,
"metadata": {}
},
{
"name": "pageviews",
"type": "integer",
"nullable": true,
"metadata": {}
},
{
"name": "transactions",
"type": "integer",
"nullable": true,
"metadata": {}
},
{
"name": "timeOnScreen",
"type": "integer",
"nullable": true,
"metadata": {}
}
]
},
"nullable": true,
"metadata": {}
},
{
"name": "channelGrouping",
"type": "string",
"nullable": true,
"metadata": {}
}
]
}

保存为schema.json文件,然后从这个json创建一个StructType,使用:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType
import json
if __name__ == "__main__":
spark = SparkSession 
.builder 
.appName("Test") 
.getOrCreate()
with open("schema.json") as f_in:
schema_data = json.load(f_in)
data = {}
schemaFromJson = StructType.fromJson(schema_data)
df = spark.createDataFrame(spark.sparkContext.parallelize(data), schemaFromJson)
df.printSchema()

结果是:

root
|-- visitorId: integer (nullable = true)
|-- visitStartTime: integer (nullable = true)
|-- totals: struct (nullable = true)
|    |-- visits: integer (nullable = true)
|    |-- hits: integer (nullable = true)
|    |-- pageviews: integer (nullable = true)
|    |-- transactions: integer (nullable = true)
|    |-- timeOnScreen: integer (nullable = true)
|-- channelGrouping: string (nullable = true)