Apache Spark在一次运行中读取多个文本文件

我可以使用以下Apache Spark Scala代码成功地将文本文件加载到DataFrame中：

val df = spark.read.text("first.txt")
  .withColumn("fileName", input_file_name())
  .withColumn("unique_id", monotonically_increasing_id())

有没有办法在一次运行中提供多个文件？像这样：

val df = spark.read.text("first.txt,second.txt,someother.txt")
  .withColumn("fileName", input_file_name())
  .withColumn("unique_id", monotonically_increasing_id())

现在，以下代码不适用于以下错误：

Exception in thread "main" org.apache.spark.sql.AnalysisException: Path does not exist: file:first.txt,second.txt,someother.txt;
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:558)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary$1.apply(DataSource.scala:545)

如何正确加载多个文本文件？

该函数spark.read.text()有一个 varargs 参数，来自文档：

def text(paths: String*): DataFrame

这意味着要读取多个文件，您只需将它们提供给以逗号分隔的函数，即

val df = spark.read.text("first.txt", "second.txt", "someother.txt")

相关内容

最新更新

热门标签：