生成带有快速压缩的兽人文件格式

假设我有一个TSV或CSV文件，Java中是否有任何程序化方法将文件转换为orc文件格式并在其上执行敏捷的压缩？

blot-这是摘要，而不是完整的代码。请使用它供参考并将其嵌入您的解决方案。

遵循一组快速的说明，您可以围绕它构建MapReduce代码。

设置驱动程序类中的输出格式和压缩编解码器

在您的驱动程序类中，将输出格式类设置为orc。像下面的[只是摘要，而不是完整的代码]

Job = job = Job.getInstance(conf);
job.setOutputFormatClass(OrcOutputFormat.class);
FileOutputFormat.setOutputCompressorClass(job,SnappyCompressor.class);

还原器需要创建可写入兽人文件中的可写值，并通常使用orcstruct.createvalue(TypeDescription(函数。就我们的示例而言，让我们假设上一节中的洗牌类型是(文本，插图(，而减少应将每个键的整数聚集在一起，并将它们作为列表写入。输出模式将是结构>。与MapReduce一样，如果您的方法存储该值，则需要在获得下一个之前复制其值。

public static class MyReducer
  extends Reducer<Text,IntWritable,NullWritable,OrcStruct> {
  private TypeDescription schema =
    TypeDescription.fromString("struct<key:string,ints:array<int>>");
  // createValue creates the correct value type for the schema
  private OrcStruct pair = (OrcStruct) OrcStruct.createValue(schema);
  // get a handle to the list of ints
  private OrcList<IntWritable> valueList =
    (OrcList<IntWritable>) pair.getFieldValue(1);
  private final NullWritable nada = NullWritable.get();
  public void reduce(Text key, Iterable<IntWritable> values,
                     Context output
                     ) throws IOException, InterruptedException {
    pair.setFieldValue(0, key);
    valueList.clear();
    for(IntWritable val: values) {
      valueList.add(new IntWritable(val.get()));
    }
    output.write(nada, pair);
  }
}

这应该使您的数据以HDFS上的Snappy Compression编解码为ORC格式。

相关内容

最新更新

热门标签：