Apache Spark throws java.io.FileNotFoundException



我正试图从运行spark master的机器中启动spark-submit,而我的工人在另一台机器中:

  • 机器A:火花大师
  • 机器B:火花从动装置

但它总是抛出异常:java.io.FileNotFoundException

19/10/30 18:19:00 WARN TaskSetManager: Lost task 1.0 in stage 0.0 (TID 1, 130.229.129.193, executor 0): java.io.FileNotFoundException: File file:/private/var/folders/mf/hvtcpzmx6s39n7xc9182fxvc0000gp/T/spark-2237692a-95c1-4355-90fe-ed4524040879/userFiles-c1150981-0e3b-4b6f-a951-5783c1d14db8/data.csv does not exist

当我在同一台机器中运行master和worker时,不会引发Exception。我知道这个问题与文件位置有关,但我使用的组合是:

  • sparkContext.addFile(fileName(
  • 和SparkFile.get(fileName(

所以,添加的文件应该在每个节点上用这个Spark作业下载,不是吗?

我不想尝试另一种解决方案(S3、HDFS…(,我只想弄清楚我做错了什么。谢谢

我正在使用的java代码:

public class Main {
public static void main(String[] args) throws IOException {
String fileName = "data.csv";
//Count words and lines
countWordsAndLines(fileName);
}
public  static void countWordsAndLines(String fileName) throws IOException {
// SparkConf object for describing the application configuration.
SparkConf sparkConf = new SparkConf()
.setAppName("Count file words and lines");
//.setMaster("local[*]");  // Delete this line when submitting to a cluster
// A SparkContext object is the main entry point for Spark.
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
//Add a file to be downloaded with this Spark job on every node.
sparkContext.addFile("https://sda6.s3.eu-central-1.amazonaws.com/test/" + fileName);
// Get the absolute path of a file added through SparkContext.addFile().
// SparkContext is used to read a text file in memory as a JavaRDD object.
JavaRDD<String> csvFile = sparkContext.textFile("file://" + SparkFiles.get(fileName));
// Count no of lines
System.out.println("Number of lines in file = " + csvFile.count());
sparkContext.stop();
}
}

为了在集群上运行我的应用程序,我正在执行以下操作:

  1. 在apachespark-conf文件夹中创建一个新文件spark-env.sh,我在其中设置了SPARK_MASTER_HOST='machine-A-IP'。我在机器A和B中这样做

(在命令行中(:

  1. 启动主机(机器A(
/usr/local/Cellar/apache-spark/2.4.4/libexec/sbin/start-master.sh
  1. 启动从机(机器B((我可以看到工人出现在机器A的主机主页上(
/usr/local/Cellar/apache-spark/2.4.4/libexec/sbin/start-slave.sh spark://machine-A-IP:7077
  1. 运行火花提交(机器A(
/usr/local/bin/spark-submit   --master spark://machine-A-IP:7077   --class Main   build/libs/spark-exercise-1.0-SNAPSHOT.jar

顺便说一句,不确定这是否有帮助,但Spark通常会抛出一个FNFE,而实际上它有一个隐藏的AccessDeniedException

最新更新