我能够使用 avro-tools-1.7.7.jar 获取 json 数据和 avro 模式并输出二进制 Avro 文件,如图 https://github.com/miguno/avro-cli-examples#json-to-avro 所示。但是,我希望能够使用 Avro python api: https://avro.apache.org/docs/1.7.7/gettingstartedpython.html 以编程方式执行此操作。
在他们的示例中,他们展示了如何一次将记录写入二进制 avro 文件。
import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter
schema = avro.schema.parse(open("user.avsc").read())
writer = DataFileWriter(open("users.avro", "w"), DatumWriter(), schema)
writer.append({"name": "Alyssa", "favorite_number": 256})
writer.append({"name": "Ben", "favorite_number": 7, "favorite_color": "red"})
writer.close()
我的用例是一次写入所有记录,就像 avro-tools jar 从 json 文件所做的那样,只是在 python 代码中。我不想掏钱并执行罐子。如果这很重要,这将部署到Google App Engine。
这可以通过fastavro
.例如,给定链接中的架构:
推特.avsc
{
"type" : "record",
"name" : "twitter_schema",
"namespace" : "com.miguno.avro",
"fields" : [ {
"name" : "username",
"type" : "string",
"doc" : "Name of the user account on Twitter.com"
}, {
"name" : "tweet",
"type" : "string",
"doc" : "The content of the user's Twitter message"
}, {
"name" : "timestamp",
"type" : "long",
"doc" : "Unix epoch time in seconds"
} ],
"doc:" : "A basic schema for storing Twitter messages"
}
和 json 文件:
推特.json
{"username":"miguno","tweet":"Rock: Nerf paper, scissors is fine.","timestamp": 1366150681 }
{"username":"BlizzardCS","tweet":"Works as intended. Terran is IMBA.","timestamp": 1366154481 }
您可以使用类似于以下脚本的内容来写出 avro 文件:
import json
from fastavro import json_reader, parse_schema, writer
with open("twitter.avsc") as fp:
schema = parse_schema(json.load(fp))
with open("twitter.avro", "wb") as avro_file:
with open("twitter.json") as fp:
writer(avro_file, schema, json_reader(fp, schema))