Spark 流式处理:任务"predict"不可序列化



我正在尝试使用模型来预测火花流程序,但我得到一个错误这样做:任务不可序列化。

代码:

val model = sc.objectFile[DecisionTreeModel]("DecisionTreeModel").first() 
val parsedData = reducedData.map { line =>
  val arr = Array(line._2._1,line._2._2,line._2._3,line._2._4,line._2._5,line._2._6,line._2._7,line._2._8,line._2._9,line._2._10,line._2._11)
  val vector = LabeledPoint(line._2._4, Vectors.dense(arr))
  model.predict(vector.features))
}

粘贴错误:

scala> val parsedData = reducedData.map { line =>
     |     val arr = Array(line._2._1,line._2._2,line._2._3,line._2._4,line._2._5,line._2._6,line._2._7,line._2._8,line._2._9,line._2._10,line._2._11)
     |     val vector=LabeledPoint(line._2._4, Vectors.dense(arr))
     |     model.predict(vector.features)
     | }
org.apache.spark.SparkException: Task not serializable
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2030)
    at org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:528)
    at org.apache.spark.streaming.dstream.DStream$$anonfun$map$1.apply(DStream.scala:528)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
    at org.apache.spark.SparkContext.withScope(SparkContext.scala:709)
    at .......

如何解决这个问题?

谢谢!

参考此链接:https://databricks.gitbooks.io/databricks-spark-knowledge-base/content/troubleshooting/javaionotserializableexception.html

在你的例子中,"model"在driver中被实例化,并在map中使用,这导致对象通过网络从driver发送到executor,所以它应该是可序列化的。如果不能使模型序列化,请尝试通过在map内部实例化模型来避免序列化。您可能还需要控制在executor中创建此对象的频率——每行一次(默认),每个任务一次(即。,线程)或每个执行器(i。e, jvm) .

最后,我不认为你可以有一个单一的全局"模型"对象,你可以引起多个执行器的突变——以防万一,这就是你正在寻找的(不管你是否需要使它序列化)。

最新更新