添加自定义字段Spark ML LabeldPoint



如何添加一些自定义字段?E用户id)预测结果?

        List<org.apache.spark.mllib.regression.LabeledPoint> localTesting = ... ;//
        // I want to add some identifier to each LabeledPoint
        DataFrame localTestDF = jsql.createDataFrame(jsc.parallelize(studyData.localTesting), LabeledPoint.class);
        DataFrame predictions = model.transform(localTestDF);
        Row[] collect = predictions.select("label", "probability", "prediction").collect();
        for (Row r : collect) {
            // and want to return identifier here.
            // so do I save I to database.
            int userNo = Integer.parseInt(r.get(0).toString());
            double prob = Double.parseDouble(r.get(1).toString());
            int prediction = Integer.parseInt(r.get(2).toString());
            log.debug(userNo + "," + prob + ", " + prediction);
        }

但当我使用这个类localTesting而不是LabeledPoint,

class NoLabeledPoint extends LabeledPoint implements Serializable {
    private static final long serialVersionUID = -2488661810406135403L;
    int userNo;
    public NoLabeledPoint(double label, Vector features) {
        super(label, features);
    }
    public int getUserNo() {
        return userNo;
    }
    public void setUserNo(int userNo) {
        this.userNo = userNo;
    }
}
        List<NoLabeledPoint> localTesting = ... ;// set every user'no to the field userNo
        // I want to add some identifier to each LabeledPoint
        DataFrame localTestDF = jsql.createDataFrame(jsc.parallelize(studyData.localTesting), LabeledPoint.class);
        DataFrame predictions = model.transform(localTestDF);
        Row[] collect = predictions.select("userNo", "probability", "prediction").collect();
        for (Row r : collect) {
            // and want to return identifier here.
            // so do I save I to database.
            int userNo = Integer.parseInt(r.get(0).toString());
            double prob = Double.parseDouble(r.get(1).toString());
            int prediction = Integer.parseInt(r.get(2).toString());
            log.debug(userNo + "," + prob + ", " + prediction);
        }

异常被抛出

org.apache.spark.sql.AnalysisException: cannot resolve 'userNo' given input columns rawPrediction, probability, features, label, prediction;
        at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:63)
        at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:52)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
        at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:286)
        at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)

我的意思是我不仅想获得预测数据(特征,标签,概率…),而且还想获得一些自定义字段。例如userNo, user_id等等从结果中:predictions.select("......")

解决。其中一行应该固定

            DataFrame localTestDF = jsql.createDataFrame(jsc.parallelize(studyData.localTesting), LabeledPoint.class);

            DataFrame localTestDF = jsql.createDataFrame(jsc.parallelize(studyData.localTesting), NoLabeledPoint.class);

由于您不使用低级MLlib API,因此根本不需要使用LabeledPoint。在您创建了DataFrame之后,您所得到的只是一个具有某些值的Row,所有重要的是类型和列名与管道中的参数相匹配。

在Scala中你可以使用任何case类

org.apache.spark.mllib.linalg.Vector; case class 
case class LabeledPointWithMeta(userNo: String, label: Double, features: Vector)
val rdd: RDD[LabeledPointWithMeta] = ???
val df = rdd.toDF

为了能够使用它,您可能应该添加@BeanInfo注释:

import scala.beans.BeanInfo
@BeanInfo
case class LabeledPointWithMeta(...)

基于Spark SQL和DataFrame指南,看起来在普通Java中你可以做这样的事情**:

import org.apache.spark.mllib.linalg.Vector;
public static class LabeledPointWithMeta implements Serializable {
  private int userNo;
  private double label;
  private Vector vector;
  public int getUserNo() {
    return userNo;
  }
  public void setUserNo(int userNo) {
    this.userNo = userNo;
  }
  public double getLabel() {
    return label;
  }
  public void setLabel(double label) {
    this.label = label;
  }
  public Vector getVector() {
    return vector;
  }
  public void seVector(Vector vector) {
    this.vector = vector;
  }
}

之后:

JavaRDD<LabeledPointWithMeta> myPoints = ...;
DataFrame df = sqlContext.createDataFrame(myPoints LabeledPointWithMeta.class);

我认为在你的代码中做一个简单的改变应该也可以工作:

DataFrame localTestDF = jsql.createDataFrame(
    jsc.parallelize(studyData.localTesting),
    NoLabeledPoint.class
); 

如果您想使用MLlib,它不会帮助您,但这部分可以通过简单的RDD转换(如zip)轻松处理。


*一些元数据,但你不会得到从LabeledPoint

**我没有测试过上面的代码,所以它可能包含一些错误

最新更新