我可以把它写进
-
ORC
-
直接PARQUET
和
-
TEXTFILE
-
AVRO
从数据块中使用额外的依赖项
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.5.0</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.10</artifactId>
<version>2.0.1</version>
</dependency>
示例代码:
SparkContext sc = new SparkContext(conf);
HiveContext hc = new HiveContext(sc);
DataFrame df = hc.table(hiveTableName);
df.printSchema();
DataFrameWriter writer = df.repartition(1).write();
if ("ORC".equalsIgnoreCase(hdfsFileFormat)) {
writer.orc(outputHdfsFile);
} else if ("PARQUET".equalsIgnoreCase(hdfsFileFormat)) {
writer.parquet(outputHdfsFile);
} else if ("TEXTFILE".equalsIgnoreCase(hdfsFileFormat)) {
writer.format("com.databricks.spark.csv").option("header", "true").save(outputHdfsFile);
} else if ("AVRO".equalsIgnoreCase(hdfsFileFormat)) {
writer.format("com.databricks.spark.avro").save(outputHdfsFile);
}
是否有办法写数据框架到hadoop SequenceFile和RCFile?
您可以使用void saveAsObjectFile(String path)
将RDD
保存为序列化对象的SequenceFile。所以在你的情况下,你必须从DataFrame
:
RDD
:JavaRDD<Row> rdd = df.javaRDD;
rdd.saveAsObjectFile(outputHdfsFile);