使用保存的火花模型评估新数据



我已经成功地构建了将数据转换为libsvm文件,并在Spark的MLLIB软件包中训练了决策树模型。我在1.6.2文档中使用了Scala代码,仅更改文件名:

import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
//  Empty categoricalFeaturesInfo indicates all features are continuous.
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "variance"
val maxDepth = 5
val maxBins = 32
val model = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo, impurity, maxDepth, maxBins)
// Evaluate model on test instances and compute test error
val labelsAndPredictions = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testMSE = labelsAndPredictions.map{ case (v, p) => math.pow(v - p, 2) }.mean()
println("Test Mean Squared Error = " + testMSE)
println("Learned regression tree model:n" + model.toDebugString)
// Save and load model
model.save(sc, "target/tmp/myDecisionTreeRegressionModel")
val sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeRegressionModel")

代码正确显示模型的MSE和学习的树模型。但是,我坚持弄清楚如何使用sameModel并使用它来评估新数据。例如,如果我用来训练模型的LIBSVM文件如下:

0 1:1.0 2:0.0 3:0.0 4:0.0 5:0.0 6:0.0 7:0.0 8:0.0 9:0.0 10:0.0 11:0.0 12:0 13:0 14:0 15:9 16:19
0 1:1.0 2:0.0 3:0.0 4:0.0 5:0.0 6:0.0 7:0.0 8:0.0 9:0.0 10:0.0 11:0.0 12:1 13:0 14:0 15:9 16:12
0 1:1.0 2:0.0 3:0.0 4:0.0 5:0.0 6:0.0 7:0.0 8:0.0 9:0.0 10:0.0 11:0.0 12:0 13:0 14:0 15:6 16:7

如何为训练的模型提供类似的东西,并预测标签?

1:1.0 2:0.0 3:0.0 4:0.0 5:0.0 6:0.0 7:0.0 8:0.0 9:0.0 10:0.0 11:0.0 12:0 13:0 14:0 15:9 16:19
1:1.0 2:0.0 3:0.0 4:0.0 5:0.0 6:0.0 7:0.0 8:0.0 9:0.0 10:0.0 11:0.0 12:1 13:0 14:0 15:9 16:12
1:1.0 2:0.0 3:0.0 4:0.0 5:0.0 6:0.0 7:0.0 8:0.0 9:0.0 10:0.0 11:0.0 12:0 13:0 14:0 15:6 16:7

编辑(2017年8月31日3:56 PM,Eastern)

根据下面的建议,我正在尝试预测功能,但看起来代码不正确:

val new_data = MLUtils.loadLibSVMFile(sc, "hdfs://.../new_data/*")
val labelsAndPredictions = new_data.map { point =>
  val prediction = sameModel.predict(point.features)
  (point.label, prediction)
}
labelsAndPredictions.take(10)

如果我使用包含" 1"值作为标签的libsvm文件运行(我正在用文件中的十行进行测试),则它们都以 labelsAndPredictions.take(10)命令中的'1.0'返回。如果我给它一个" 0"值,那么它们都以" 0.0"的形式返回,因此似乎没有正确预测任何内容。

  1. 加载原始数据(就像您上面一样,类似的libsvm文件)
  2. 提供有关分类功能的信息
  3. 上面数据中的每个点都会通过致电:savedmodel.predict(point.features)
  4. 进行预测

负载方法应返回模型。然后使用rdd [vector]或单个向量调用predict

您可以通过 Pipeline从磁盘上加载ML模型:

import org.apache.spark.ml._
val pipeline = Pipeline.read.load("sample-pipeline")
scala> val stageCount = pipeline.getStages.size
stageCount: Int = 0
val pipelineModel = PipelineModel.read.load("sample-model")
scala> pipelineModel.stages

获得pipeline使用后可以在数据集上进行评估:

val model = pipeline.fit(dataset)
val predictions = model.transform(dataset)

您必须使用适当的Evaluator,例如RegressionEvaluator。评估符在数据集上工作:

import org.apache.spark.ml.evaluation.RegressionEvaluator
val regEval = new RegressionEvaluator
println(regEval.explainParams)
regEval.evaluate(predictions)

upd 如果您已经处理了hdfs,则可以轻松加载/保存模型:

将模型保存到HDFS的一种方法如下:

// persist model to HDFS
sc.parallelize(Seq(model), 1).saveAsObjectFile("hdfs:///user/root/sample-model")
然后可以将保存的模型加载为:
val linRegModel = sc.objectFile[LinearRegressionModel]("hdfs:///user/root/sample-model").first()
linRegModel.predict(Vectors.dense(11.0, 2.0, 2.0, 1.0, 2200.0))

或在上面的示例中喜欢,而是本地文件hdfs

PipelineModel.read.load("hdfs:///user/root/sample-model")

用HDFS将文件运送到所有节点都可以在群集中看到的目录。在您的代码中加载并预测。

最新更新