如何将Parquet文件转换为Protobuf并保存为HDFS/AWS S3

我有一个Parquet格式的文件。我想阅读它，并使用带有Scala的spark将其保存在Protobuf格式的HDFS或AWS S3中。我不确定怎么办。搜索了很多博客，但什么都看不懂，有人能帮忙吗？

您可以使用ProtoParketReader，它是具有ProtoReadSupport的ParquetReader。

类似于：

try (ParquetReader reader = ProtoParquetReader.builder(path).build()
) {
while ((model = reader.read()) != null){
System.out.println("check model " + "-- " + model);
...
}
} catch (IOException e) {
e.printStackTrace();
}

为了阅读拼花地板，您需要使用以下代码：

public List<Record> read(Path path) {
List<Record> records = new ArrayList<>();
ParquetReader<Record> reader = AvroParquetReader<Record>builder(path).withConf(new Configuration()).build();
for (Record value = reader.read(); value != null; value = reader.read()) {
records.add(value);
}
return records;
}

用镶木地板写文件就是这样。虽然这不是protobuf文件，但这可能有助于您入门。请记住，如果你最终使用protobuf v2.6和更高版本的的火花流，你会遇到问题

public void write(List<Record> records, String location) throws IOException {
Path filePath = new Path(location);
try (ParquetWriter<Record> writer = AvroParquetWriter.<GenericData.Record>builder(filePath)
.withSchema(getSchema()) //
.withConf(getConf()) //
.withCompressionCodec(CompressionCodecName.SNAPPY) //
.withWriteMode(Mode.CREATE) //
.build()) {
for (Record record : records) {
writer.write(record);
}
} catch (Exception e) {
e.printStackTrace();
}
}

相关内容

最新更新

热门标签：