获取所有 hdfs 目录,仅获取任何文件



有没有办法获取所有hdfs目录而不是文件,即如果y hdfs结构如下所示:

/user/classA/part-r-0000
/user/classA/part-r-0001
/user/classA/part-r-0002
/user/classA/_counter/val1
/user/classA/_counter/val2
/user/classA/_counter/val3
/user/classA/_counter/val4
/user/classB/part-r-0000
/user/classB/part-r-0001
/user/classB/_counter/val1
/user/classB/_counter/status/test_file1

结果应该是

/user/classA/
/user/classA/_counter
/user/classB
/user/classB/_counter
/user/classB/_counter/status/
hdfs dfs -ls -R /user | grep "^d"

既然你想要火花(添加了 apache-spark 标签(Hadoop解决方案,我认为它不仅仅是 hdfs 命令

  • 逻辑是使用Spark列出Hadoop文件系统的所有文件状态...

isDirectory将根据它过滤的内容检查目录。

    package examples
    import org.apache.log4j.Level
    import org.apache.spark.sql.SparkSession
    object ListHDFSDirectories  extends  App{
      val logger = org.apache.log4j.Logger.getLogger("org")
      logger.setLevel(Level.WARN)
      val spark = SparkSession.builder()
        .appName(this.getClass.getName)
        .config("spark.master", "local[*]").getOrCreate()
      val hdfspath = "." // your path here
      import org.apache.hadoop.fs.{FileSystem, Path}
      val fs = org.apache.hadoop.fs.FileSystem.get(spark.sparkContext.hadoopConfiguration)
      fs.listStatus(new Path(s"${hdfspath}")).filter(_.isDirectory).map(_.getPath).foreach(println)
    }

结果:

file:/Users/user/codebase/myproject/target
file:/Users/user/codebase/myproject/Rel
file:/Users/user/codebase/myproject/spark-warehouse
file:/Users/user/codebase/myproject/metastore_db
file:/Users/user/codebase/myproject/.idea
file:/Users/user/codebase/myproject/src

最新更新