我使用IntelliJ IDE在Microsoft Windows平台上执行Spark Scala代码。
我有四个Spark数据帧,每个数据帧大约有30000条记录,作为我需求的一部分,我试图从每个数据帧中提取一列。
我使用了Spark SQL函数来完成它,并成功地执行了它。当我执行DF.show()或DF.count()方法时,我可以在屏幕上看到结果,但当我试图将数据帧写入本地磁盘(windows目录)时,作业被中止,并出现以下错误:
线程"main"org.apache.spark.SparkException异常:作业中止。在org.apache.spark.sql.exexecution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:147)在org.apache.spark.sql.exexecution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)在org.apache.spark.sql.exexecution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)在org.apache.spark.sql.exexecution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)在org.apache.spark.sql.exexecution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121)在org.apache.spark.sql.exexecution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHardoopFsRelationCommand.scala:101)在org.apache.spark.sql.exexecution.commandExec.sideEffectResult$lzycompute(command.scala:58)在org.apache.spark.sql.exexecution.commandExec.sideEffectResult(command.scala:56)在org.apache.spark.sql.exexecution.commandExec.doExecute(command.scala:74)在org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)在org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)在org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)在org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperation Scope.scala:151)在org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)在org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)在org.apache.spark.sql.exexecution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)在org.apache.spark.sql.exexecution.QueryExecution.toRdd(QueryExecution.scala:87)在org.apache.spark.sql.exexecution.datasources.DataSource.write(DataSource.scala:492)在org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)在org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)在main.src.countFeatures2$.countFeature$1(countFeatures.scala:118)在main.src.countFeatures2$.getFeatureAsString$1(countFeatures2.scala:32)在main.src.countFeatures2$.main(countFeatures2.scala:40)main.src.countFeatures2.main(countFeatures2.scala)位于位于的sun.reflect.NativeMethodAccessorImpl.invoke0(本机方法)sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)在sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)位于java.lang.reflect.Method.ioke(Method.java:498)com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)由:org.apache.spark.SparkException引起:作业由于阶段而中止失败:阶段31.0中的任务0失败了1次,最近一次失败:在阶段31.0中丢失任务0.0(TID 2636,localhost,执行程序驱动程序):java.io.IOException:命令字符串中的(null)条目:null chmod 0644D: \Test_Output_File2_temporary\0_temppt_20170830194047_0031_000000_0\part-00000-85c32c55-e12d-4433-979d-ccecb2fcd341.csv在org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:770)网址:org.apache.hoop.util.Shell.execCommand(Shell.java:866)org.apache.hoop.util.Shell.execCommand(Shell.java:849)org.apache.hadop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)在org.apache.hoop.fs.RawLocalFileSystem$LocalFSFileOutputStream。(RawLocalFileSystem.java:225)在org.apache.hoop.fs.RawLocalFileSystem$LocalFSFileOutputStream。(RawLocalFileSystem.java:209)在org.apache.hadop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)在org.apache.hadop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)在org.apache.hadop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)在org.apache.hadop.fs.ChecksumFileSystem$ChecksumFSOutputSummer。(ChecksumFileSystem.java:398)在org.apache.hadop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)在org.apache.hadop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)网址:org.apache.hoop.fs.FileSystem.create(FileSystem.java:911)org.apache.hoop.fs.FileSystem.create(FileSystem.java:892)org.apache.hoop.fs.FileSystem.create(FileSystem.java:789)org.apache.hadop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:132)在org.apache.spark.sql.exexecution.datasources.csv.CsvOutputWriter。(CSVRelation.scala:208)在org.apache.spark.sql.exexecution.datasources.csv.CSVOutputWriterFactory.newInstance(CSVRelation.scala:178)在org.apache.spark.sql.exexecution.datasources.FileFormatWriter$SingleDirectoryWriteTask。(FileFormatWriter.scala:234)在org.apache.spark.sql.exexecution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executieTask(FileFormatWriter.scala:182)在org.apache.spark.sql.exexecution.datasources.FileFormatWriter$$anonfun$write$$anonfon$3.apply(FileFormatWriter.scala:129)在org.apache.spark.sql.exexecution.datasources.FileFormatWriter$$anonfun$write$$anonfon$3.apply(FileFormatWriter.scala:128)网址:org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)在org.apache.spark.scheduler.Task.run(Task.scala:99)org.apache.spark.executor.executor$TaskRunner.run(executor.scala:282)在java.util.concurrent.ThreadPoolExecutiator.runWorker(ThreadPoolExecutiator.java:1142)在java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)在java.lang.Thread.run(Thread.java:745)
驱动程序堆栈:位于org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGSchedler$$failJobAndIndependentStages(DAGScheuler.scala:1435)在org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)在org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)在scala.collection.mutable.RizableArray$class.foreach(ResizableArray.scala:59)位于scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)在org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)在org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)在scala。Option.foreach(Option.scala:257)org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheudler.scala:1650)在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheudler.scala:1605)在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheudler.scala:1594)网址:org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)在org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)网址:org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)org.apache.spark.sql.exexecution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:127)…28更多原因:java.io.IOException:命令中的(null)条目字符串:null chmod 0644D: \Test_Output_File2_temporary\0_temppt_20170830194047_0031_000000_0\part-00000-85c32c55-e12d-4433-979d-ccecb2fcd341.csv在org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:770)网址:org.apache.hoop.util.Shell.execCommand(Shell.java:866)org.apache.hoop.util.Shell.execCommand(Shell.java:849)org.apache.hadop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:733)在org.apache.hoop.fs.RawLocalFileSystem$LocalFSFileOutputStream。(RawLocalFileSystem.java:225)在org.apache.hoop.fs.RawLocalFileSystem$LocalFSFileOutputStream。(RawLocalFileSystem.java:209)在org.apache.hadop.fs.RawLocalFileSystem.createOutputStreamWithMode(RawLocalFileSystem.java:307)在org.apache.hadop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:296)在org.apache.hadop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:328)在org.apache.hadop.fs.ChecksumFileSystem$ChecksumFSOutputSummer。(ChecksumFileSystem.java:398)在org.apache.hadop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:461)在org.apache.hadop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:440)网址:org.apache.hoop.fs.FileSystem.create(FileSystem.java:911)org.apache.hoop.fs.FileSystem.create(FileSystem.java:892)org.apache.hoop.fs.FileSystem.create(FileSystem.java:789)org.apache.hadop.mapreduce.lib.output.TextOutputFormat.getRecordWriter(TextOutputFormat.java:132)在org.apache.spark.sql.exexecution.datasources.csv.CsvOutputWriter。(CSVRelation.scala:208)在org.apache.spark.sql.exexecution.datasources.csv.CSVOutputWriterFactory.newInstance(CSVRelation.scala:178)在org.apache.spark.sql.exexecution.datasources.FileFormatWriter$SingleDirectoryWriteTask。(FileFormatWriter.scala:234)在org.apache.spark.sql.exexecution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executieTask(FileFormatWriter.scala:182)在org.apache.spark.sql.exexecution.datasources.FileFormatWriter$$anonfun$write$$anonfon$3.apply(FileFormatWriter.scala:129)在org.apache.spark.sql.exexecution.datasources.FileFormatWriter$$anonfun$write$$anonfon$3.apply(FileFormatWriter.scala:128)网址:org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)在org.apache.spark.scheduler.Task.run(Task.scala:99)org.apache.spark.executor.executor$TaskRunner.run(executor.scala:282)在java.util.concurrent.ThreadPoolExecutiator.runWorker(ThreadPoolExecutiator.java:1142)在java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)在java.lang.Thread.run(Thread.java:745)拾取_java_OPTIONS:-Xmx512M
进程结束,退出代码为1
我不明白哪里出了问题。有人能解释一下如何克服这个问题吗?
UPDATE请注意,直到昨天我还能够编写相同的文件,并且我的系统或IDE的配置没有任何更改。所以我不明白为什么它一直运行到昨天,为什么现在不运行
这个链接中有一个类似的帖子:Pyspark上saveAsTextFile()中命令字符串异常中的(null)条目,但他们在Jupiter笔记本上使用Pyspark,而我的问题是IntelliJ IDE
将输出文件写入本地磁盘的超级简化代码
val Test_Output =spark.sql("select A.Col1, A.Col2, B.Col2, C.Col2, D.Col2 from A, B, C, D where A.primaryKey = B.primaryKey and B.primaryKey = C.primaryKey and C.primaryKey = D.primaryKey and D.primaryKey = A.primaryKey")
val Test_Output_File = Test_Output.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").option("nullValue", "0").save("D:/Test_Output_File")
似乎与文件系统有关:java.io.IOException: (null) entry in command string: null chmod 0644
由于您在windows上运行,您是否已使用winutils.exe将HADOOP_HOME设置为文件夹?
最后我纠正了自己。我在创建数据帧时使用了.persist()方法。这帮助我在没有任何错误的情况下编写了输出文件。尽管我不明白背后的逻辑。
感谢您对此的宝贵投入