Interpreting rawPrediction from Spark ML LinearSVC



我在二进制分类模型中使用Spark ML的LinearSVC。transform方法创建两列,predictionrawPrediction。Spark的文档没有提供任何方法来解释这个特定分类器的rawPrediction列。这个问题已经为其他分类器提出并回答过,但不是专门为LinearSVC提出的。

我的predictions数据帧中的相关列:

+------------------------------------------+ 
|rawPrediction                             | 
+------------------------------------------+ 
|[0.8553257800650063,-0.8553257800650063]  | 
|[0.4230977574196645,-0.4230977574196645]  | 
|[0.49814263303537865,-0.49814263303537865]| 
|[0.9506355050332026,-0.9506355050332026]  | 
|[0.5826887000450813,-0.5826887000450813]  | 
|[1.057222808292026,-1.057222808292026]    | 
|[0.5744214192446275,-0.5744214192446275]  | 
|[0.8738081933835614,-0.8738081933835614]  | 
|[1.418173816502859,-1.418173816502859]    | 
|[1.0854125533426737,-1.0854125533426737]  | 
+------------------------------------------+

显然,这不仅仅是属于每个类的概率。它是什么?

编辑:由于已经请求了输入代码,因此这里有一个基于原始数据集中特性子集的模型。使用Spark的LinearSVC拟合任何数据都会生成此列。

var df = sqlContext
.read
.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/FileStore/tables/full_frame_20180716.csv")

var assembler = new VectorAssembler()
.setInputCols(Array("oy_length", "ah_length", "ey_length", "vay_length", "oh_length", 
"longest_word_length", "total_words", "repeated_exact_words",
"repeated_bigrams", "repeated_lemmatized_words", 
"repeated_lemma_bigrams"))
.setOutputCol("features")
df = assembler.transform(df)
var Array(train, test) = df.randomSplit(Array(.8,.2), 42)
var supvec = new LinearSVC()
.setLabelCol("written_before_2004")
.setMaxIter(10)
.setRegParam(0.001)
var supvecModel = supvec.fit(train)
var predictions = supvecModel.transform(test)
predictions.select("rawPrediction").show(20, false)

输出:

+----------------------------------------+ 
|rawPrediction | 
+----------------------------------------+ 
|[1.1502868455791242,-1.1502868455791242]| 
|[0.853488887006264,-0.853488887006264] | 
|[0.8064994501574174,-0.8064994501574174]| 
|[0.7919862003563363,-0.7919862003563363]| 
|[0.847418035176922,-0.847418035176922] | 
|[0.9157433788236442,-0.9157433788236442]| 
|[1.6290888181913814,-1.6290888181913814]| 
|[0.9402461917731906,-0.9402461917731906]| 
|[0.9744052798627367,-0.9744052798627367]| 
|[0.787542624053347,-0.787542624053347] | 
|[0.8750602657901001,-0.8750602657901001]| 
|[0.7949414037722276,-0.7949414037722276]| 
|[0.9163545832998052,-0.9163545832998052]| 
|[0.9875454213431247,-0.9875454213431247]| 
|[0.9193015302646135,-0.9193015302646135]| 
|[0.9828623328048487,-0.9828623328048487]| 
|[0.9175976004208621,-0.9175976004208621]| 
|[0.9608750388820302,-0.9608750388820302]| 
|[1.029326217566756,-1.029326217566756] | 
|[1.0190290910146256,-1.0190290910146256]| +----------------------------------------+ 
only showing top 20 rows

它是(-margin, margin)

override protected def predictRaw(features: Vector): Vector = {
val m = margin(features)
Vectors.dense(-m, m)
}

正如arpad所提到的,它是边距。

利润率是:

margin = coefficients * feature + intercept    
or
y = w * x + b

如果用系数的范数除以裕度,就可以得到每个数据点到超平面的距离。

最新更新