Scala中的Spark Dataframe映射分区

有人有数据帧的mapPartitions函数的工作示例吗？

请注意：我不是在看RDD的例子。

更新：

MasterBuilder发布的示例，如果理论上可以，但实际上存在一些问题。请尝试获得像Json 这样的结构化数据流

val df = spark.load.json("/user/cloudera/json")
val newDF = df.mapPartitions(
iterator => {
val result = iterator.map(data=>{/* do some work with data */}).toList
//return transformed data
result.iterator
//now convert back to df
}
).toDF()

以以下错误结束：

<console>:28: error: Unable to find encoder for type stored in a Dataset.  
Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._  
Support for serializing other types will be added in future releases.

有办法让它发挥作用吗？上面的代码出了什么问题？

您需要一个编码器。如果您的最终数据帧与输入数据帧具有相同的模式，那么它与一样简单

import org.apache.spark.sql.catalyst.encoders.RowEncoder
implicit val encoder = RowEncoder(df.schema)

如果没有，你需要"；重新定义"；架构并创建编码器。

有关编码器问题的更多信息，请参阅尝试将数据帧行映射到更新的行问题时的编码器错误

import sqlContext.implicits._
val newDF = df.mapPartitions(
iterator => {
val result = iterator.map(data=>{/* do some work with data */}).toList
//return transformed data
result.iterator
//now convert back to df
}
).toDF()

相关内容

最新更新

热门标签：