Spark - Java - 创建 Parquet/Avro 而不使用 Spark SQL 的数据帧 - Spark - Java - Create Parquet/Avro Without Using Dataframes of Spark SQL 小贝子编程网

我想获取Spark应用程序的输出(我们只使用核心Spark，从事该项目的人不想将其更改为Spark SQL(作为Parquet或Avro文件。

当我寻找这两种文件类型时，我找不到任何没有数据帧或一般Spark SQL的示例。我可以在不使用SparkSQL的情况下实现这一点吗？

我的数据是表格，它有列，但在处理过程中，将使用所有数据，而不是一列。它的列是在运行时决定的，所以没有"name，ID，adress"那种通用列。它看起来像这样：

No f1       f2       f3       ...
1, 123.456, 123.457, 123.458, ...
2, 123.789, 123.790, 123.791, ...
...

如果不将 rdd 转换为数据帧，则无法将其保存在镶木地板中。Rdd 没有模式，但 parquet 文件是列格式，需要模式，因此我们需要将其转换为数据帧。

您可以使用创建数据帧 API

我试过这个，它就像一个冠军......

public class ParquetHelper{
static ParquetWriter<GenericData.Record> writer = null;
private static Schema schema;
public ParquetHelper(Schema schema, String pathName){
try {
Path path = new Path(pathName);
writer = AvroParquetWriter.
<GenericData.Record>builder(path)
.withRowGroupSize(ParquetWriter.DEFAULT_BLOCK_SIZE)
.withPageSize(ParquetWriter.DEFAULT_PAGE_SIZE)
.withSchema(schema)
.withConf(new Configuration())
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withValidation(true)
.withDictionaryEncoding(false)
.build();
this.schema = schema;
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
/*
* 
*/
public static void writeToParquet(JavaRDD<Record> empRDDRecords) throws IOException {
empRDDRecords.foreach(record -> {
if(null != record && new RecordValidator().validate(record, schema).isEmpty()){
writeToParquet(record);
}// TODO collect bad records here
});
writer.close();
}
}

Spark - Java - 创建 Parquet/Avro 而不使用 Spark SQL 的数据帧

相关内容

最新更新

热门标签：