AWS EMR Spark-获取CSV并与SparkSQL API一起使用

//download file  csv
ByteArrayOutputStream downloadedFile = downloadFile();
//save file in temp folder csv   (
java.io.File tmpCsvFile = save(downloadedFile);
//reading
Dataset<Row> ds = session
        .read()
        .option("header", "true") 
        .csv(tmpCsvFile.getAbsolutePath())

tmpcsvfile保存在以下路径中：

Reading 的异常：

org.apache.spark.sql.analysisexception：不存在路径：

我认为问题是该文件是在本地保存的，当我尝试阅读Spark-SQL API时，它找不到该文件。我已经尝试使用SparkContext.Addfile（），并且不起作用。

有什么解决方案？

谢谢

火花支持大量文件系统，用于读写。

本地/常规（file：//）
s3（s3：//）
hdfs（hdfs：//）

作为标准行为，如果未指定URI SPARK-SQL将使用HDFS：//driver_address：port/path。

将文件：///添加到路径的解决方案，只能在客户端模式中工作，在我的情况下（群集） nt。当驱动程序创建用于读取文件的任务时，它将将其传递给没有文件的一个节点之一。

我们能做什么？在Hadoop上写一个文件。

   Configuration conf = new Configuration();
   ByteArrayOutputStream downloadedFile = downloadFile();
   //convert outputstream in inputstream
   InputStream is=Functions.FROM_BAOS_TO_IS.apply(fileOutputStream);
   String myfile="miofile.csv";
   //acquiring the filesystem
   FileSystem fs = FileSystem.get(URI.create(dest),conf);
   //openoutputstream to hadoop
   OutputStream outf = fs.create( new Path(dest));
   //write file 
   IOUtils.copyBytes(tmpIS, outf, 4096, true);
   //commit the read task
   Dataset<Row> ds = session
    .read()
    .option("header", "true") 
    .csv(myfile)

谢谢，欢迎任何更好的解决方案

相关内容

最新更新

热门标签：