我在以下链接中测试了JSON数据
http://developer.trade.gov/api/market-research-library.json
当我尝试通过以下方式直接从中读取架构时
public void readJsonFormat() {
Dataset<Row> people = spark.read().json("market-research-library.json");
people.printSchema();
}
它给了我错误,因为
root
|-- _corrupt_record: string (nullable = true)
如果格式不正确,如何将其转换为 Spark 预期的格式。
将 json 转换为单行。
或者将option("multiLine", true)
设置为允许乘行 json。
如果这是您要转换为dataframe
的唯一json
,那么我建议您使用wholeTextFiles
api。由于json
不是Spark可读格式,因此仅当整个数据作为一个参数读取并且 API 执行此操作时wholeTextFiles
才能将其转换为Spark可读格式。
然后,您可以从json
字符串中replace
换行符和空格。最后,您应该要求dataframe
.
sqlContext.read.json(sc.wholeTextFiles("path to market-research-library.json file").map(_._2.replace("n", "").replace(" ", "")))
您应该具备所需的dataframe
,并具有以下schema
root
|-- basePath: string (nullable = true)
|-- definitions: struct (nullable = true)
| |-- Report: struct (nullable = true)
| | |-- properties: struct (nullable = true)
| | | |-- click_url: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- country: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- description: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- expiration_date: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- id: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- industry: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- report_type: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- source_industry: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- title: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | | |-- url: struct (nullable = true)
| | | | |-- description: string (nullable = true)
| | | | |-- type: string (nullable = true)
|-- host: string (nullable = true)
|-- info: struct (nullable = true)
| |-- description: string (nullable = true)
| |-- title: string (nullable = true)
| |-- version: string (nullable = true)
|-- paths: struct (nullable = true)
| |-- /market_research_library/search: struct (nullable = true)
| | |-- get: struct (nullable = true)
| | | |-- description: string (nullable = true)
| | | |-- parameters: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- description: string (nullable = true)
| | | | | |-- format: string (nullable = true)
| | | | | |-- in: string (nullable = true)
| | | | | |-- name: string (nullable = true)
| | | | | |-- required: boolean (nullable = true)
| | | | | |-- type: string (nullable = true)
| | | |-- responses: struct (nullable = true)
| | | | |-- 200: struct (nullable = true)
| | | | | |-- description: string (nullable = true)
| | | | | |-- schema: struct (nullable = true)
| | | | | | |-- items: struct (nullable = true)
| | | | | | | |-- $ref: string (nullable = true)
| | | | | | |-- type: string (nullable = true)
| | | |-- summary: string (nullable = true)
| | | |-- tags: array (nullable = true)
| | | | |-- element: string (containsNull = true)
|-- produces: array (nullable = true)
| |-- element: string (containsNull = true)
|-- schemes: array (nullable = true)
| |-- element: string (containsNull = true)
|-- swagger: string (nullable = true)
Spark 期望的格式是 JSONL(JSON 行),这不是标准的 JSON。从这里知道这一点。下面是一个小的python脚本,用于将json转换为预期的格式:
import jsonlines
import json
with open('C:/Users/ak/Documents/card.json', 'r') as f:
json_data = json.load(f)
with jsonlines.open('C:/Users/ak/Documents/card_lines.json', 'w') as writer:
writer.write_all(json_data)
然后,您可以像在代码中编写的那样访问程序中的文件。