Spark Xml读取，包括文件名

我正在尝试使用spark-xml像下面这样读取文件夹中的所有文件:

val df = sparkSession
  .read
  .format("com.databricks.spark.xml")
  .schema(customSchema)
  .option("rootTag", "Transactions")
  .option("rowTag", "Transaction")
  .load("/Users/spark/Desktop/sample")

在示例文件夹中，有X个xml文件。

基于我提供的customSchema，每个文件将变为1..基于事务标记的数目的N行。但是我想要的是还包括xml文件名作为每条记录的额外列。

我搜索了spark-xml github选项，但似乎没有理想的结果。

请给出建议，或者也许我可以用不同的方法实现目标?

谢谢,

使用sql函数input_file_name。在你的例子中应该是像

这样的东西

import org.apache.spark.sql.functions._
val dfWithFile = df.withColumn("file",input_file_name)

您可以使用input_file_name()函数，并在读取时使用withColumn将该函数链接到加载选项之后。

val df = sparkSession
  .read
  .format("com.databricks.spark.xml")
  .schema(customSchema)
  .option("rootTag", "Transactions")
  .option("rowTag", "Transaction")
  .load("/Users/spark/Desktop/sample")
  .withColumn("FileName",input_file_name())

相关内容

最新更新

热门标签：