转换为 Spark 期望的 JSON 格式,以便在 Java 中创建数据帧的架构



我在以下链接中测试了JSON数据

http://developer.trade.gov/api/market-research-library.json

当我尝试通过以下方式直接从中读取架构时

public void readJsonFormat() {
Dataset<Row> people = spark.read().json("market-research-library.json");
people.printSchema();
}

它给了我错误,因为

root
|-- _corrupt_record: string (nullable = true)

如果格式不正确,如何将其转换为 Spark 预期的格式。

将 json 转换为单行。

或者将option("multiLine", true)设置为允许乘行 json。

如果这是您要转换为dataframe的唯一json,那么我建议您使用wholeTextFilesapi。由于json不是Spark可读格式,因此仅当整个数据作为一个参数读取并且 API 执行此操作时wholeTextFiles才能将其转换为Spark可读格式。

然后,您可以从json字符串中replace换行符和空格。最后,您应该要求dataframe.

sqlContext.read.json(sc.wholeTextFiles("path to market-research-library.json file").map(_._2.replace("n", "").replace(" ", "")))

您应该具备所需的dataframe,并具有以下schema

root
|-- basePath: string (nullable = true)
|-- definitions: struct (nullable = true)
|    |-- Report: struct (nullable = true)
|    |    |-- properties: struct (nullable = true)
|    |    |    |-- click_url: struct (nullable = true)
|    |    |    |    |-- description: string (nullable = true)
|    |    |    |    |-- type: string (nullable = true)
|    |    |    |-- country: struct (nullable = true)
|    |    |    |    |-- description: string (nullable = true)
|    |    |    |    |-- type: string (nullable = true)
|    |    |    |-- description: struct (nullable = true)
|    |    |    |    |-- description: string (nullable = true)
|    |    |    |    |-- type: string (nullable = true)
|    |    |    |-- expiration_date: struct (nullable = true)
|    |    |    |    |-- description: string (nullable = true)
|    |    |    |    |-- type: string (nullable = true)
|    |    |    |-- id: struct (nullable = true)
|    |    |    |    |-- description: string (nullable = true)
|    |    |    |    |-- type: string (nullable = true)
|    |    |    |-- industry: struct (nullable = true)
|    |    |    |    |-- description: string (nullable = true)
|    |    |    |    |-- type: string (nullable = true)
|    |    |    |-- report_type: struct (nullable = true)
|    |    |    |    |-- description: string (nullable = true)
|    |    |    |    |-- type: string (nullable = true)
|    |    |    |-- source_industry: struct (nullable = true)
|    |    |    |    |-- description: string (nullable = true)
|    |    |    |    |-- type: string (nullable = true)
|    |    |    |-- title: struct (nullable = true)
|    |    |    |    |-- description: string (nullable = true)
|    |    |    |    |-- type: string (nullable = true)
|    |    |    |-- url: struct (nullable = true)
|    |    |    |    |-- description: string (nullable = true)
|    |    |    |    |-- type: string (nullable = true)
|-- host: string (nullable = true)
|-- info: struct (nullable = true)
|    |-- description: string (nullable = true)
|    |-- title: string (nullable = true)
|    |-- version: string (nullable = true)
|-- paths: struct (nullable = true)
|    |-- /market_research_library/search: struct (nullable = true)
|    |    |-- get: struct (nullable = true)
|    |    |    |-- description: string (nullable = true)
|    |    |    |-- parameters: array (nullable = true)
|    |    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |    |-- description: string (nullable = true)
|    |    |    |    |    |-- format: string (nullable = true)
|    |    |    |    |    |-- in: string (nullable = true)
|    |    |    |    |    |-- name: string (nullable = true)
|    |    |    |    |    |-- required: boolean (nullable = true)
|    |    |    |    |    |-- type: string (nullable = true)
|    |    |    |-- responses: struct (nullable = true)
|    |    |    |    |-- 200: struct (nullable = true)
|    |    |    |    |    |-- description: string (nullable = true)
|    |    |    |    |    |-- schema: struct (nullable = true)
|    |    |    |    |    |    |-- items: struct (nullable = true)
|    |    |    |    |    |    |    |-- $ref: string (nullable = true)
|    |    |    |    |    |    |-- type: string (nullable = true)
|    |    |    |-- summary: string (nullable = true)
|    |    |    |-- tags: array (nullable = true)
|    |    |    |    |-- element: string (containsNull = true)
|-- produces: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- schemes: array (nullable = true)
|    |-- element: string (containsNull = true)
|-- swagger: string (nullable = true)

Spark 期望的格式是 JSONL(JSON 行),这不是标准的 JSON。从这里知道这一点。下面是一个小的python脚本,用于将json转换为预期的格式:

import jsonlines
import json

with open('C:/Users/ak/Documents/card.json', 'r') as f:
json_data = json.load(f)
with jsonlines.open('C:/Users/ak/Documents/card_lines.json', 'w') as writer:
writer.write_all(json_data)

然后,您可以像在代码中编写的那样访问程序中的文件。

相关内容

  • 没有找到相关文章