文件不被保存在Azure blob使用Spark在HDInsights集群

我们已经在Azure上设置了HDInsights集群，并将Blob作为Hadoop的存储。我们尝试使用Hadoop CLI上传文件到Hadoop，文件正在被上传到Azure Blob。

上传命令:

hadoop fs -put somefile /testlocation

然而，当我们尝试使用Spark向Hadoop写入文件时，它没有被上传到Azure Blob存储，而是上传到虚拟机磁盘的hdfs-site.xml中为datanode指定的目录

代码:

df1mparquet = spark.read.parquet("hdfs://hostname:8020/dataSet/parquet/")
df1mparquet .write.parquet("hdfs://hostname:8020/dataSet/newlocation/")

奇怪的行为:

当我们运行:

hadoop fs -ls / => It lists the files from Azure Blob storage
hadoop fs -ls hdfs://hostname:8020/ => It lists the files from local storage

这是预期的行为吗?

您需要查看core-site.xml中fs.defaultFS的值

听起来默认的文件系统是blob存储

https://hadoop.apache.org/docs/current/hadoop-azure/index.html

关于Spark，如果它加载与CLI相同的hadoop配置，则不需要指定namenode主机/端口，只需使用文件路径，并且它也默认为blob存储。

如果您为不同的文件系统指定了一个完整的URI，那么它将使用它，但是hdfs://应该与实际的本地file://不同

相关内容