使用spark(Scala)从hdfs读取文件

请告诉我如何从hdfs读取文件。我刚刚开始使用Scala和Spark。我可以读取位于文件夹中的单独文件:

val parqDF = spark.read.parquet("hdfs://nn1home:8020/user/stg/ads/year=2020/month=1/day=1/16_data.0.parq")

但是我想看整个文件夹和所有的花束

还有一个重要的问题,我如何添加列到我的数据框架与数据从路径那里有我的花束

谢谢你的建议

import org.apache.spark.sql.functions.lit
val inputPath = "<you path>"
val dataDF = spark.read.parquet(inputPath)
dataDF.printSchema()
//    root
//    |-- _c0: string (nullable = true)
//    |-- _c1: string (nullable = true)
//    |-- _c2: string (nullable = true)
//    |-- _c3: string (nullable = true)
val resDF = dataDF.withColumn("new_col", lit(inputPath))
resDF.printSchema
//    root
//    |-- _c0: string (nullable = true)
//    |-- _c1: string (nullable = true)
//    |-- _c2: string (nullable = true)
//    |-- _c3: string (nullable = true)
//    |-- new_col: string (nullable = false)
resDF.schema
//    res2: org.apache.spark.sql.types.StructType = StructType(
//      StructField(_c0,StringType,true), 
//      StructField(_c1,StringType,true), 
//      StructField(_c2,StringType,true), 
//      StructField(_c3,StringType,true), 
//      StructField(new_col,StringType,false)
//    )
// resDF.show(false) - show data dataframe

相关内容

最新更新

热门标签：