我正在尝试在Spark中的分类预测中索引回预测概率。 我有一个带有红色、绿色、蓝色标签的多类分类器的输入数据。
输入数据框:
+-----+---+---+---+---+---+---+---+---+---+----+----+----+----+
| _c0|_c1|_c2|_c3|_c4|_c5|_c6|_c7|_c8|_c9|_c10|_c11|_c12|_c13|
+-----+---+---+---+---+---+---+---+---+---+----+----+----+----+
| red| 0| 0| 0| 1| 0| 0| 0| 2| 3| 2| 2| 0| 5|
|green| 5| 6| 0| 14| 0| 5| 0| 95| 2| 120| 0| 0| 9|
|green| 6| 1| 0| 3| 0| 4| 0| 21| 22| 11| 0| 0| 23|
| red| 0| 1| 0| 1| 0| 4| 0| 1| 4| 2| 0| 0| 5|
|green| 37| 9| 0| 19| 0| 31| 0| 87| 9| 108| 0| 0| 170|
+-----+---+---+---+---+---+---+---+---+---+----+----+----+----+
only showing top 5 rows
我使用 StringIndexer 为标签列编制索引,并使用 VectorAssembler 从特征列创建特征向量。
解析的数据帧:
+-----+--------------------+
|label| features|
+-----+--------------------+
| 1.0|(13,[3,7,8,9,10,1...|
| 0.0|[5.0,6.0,0.0,14.0...|
| 0.0|[6.0,1.0,0.0,3.0,...|
| 1.0|(13,[1,3,5,7,8,9,...|
| 0.0|[37.0,9.0,0.0,19....|
+-----+--------------------+
only showing top 5 rows
使用此数据生成随机森林分类模型。 在查询时,我将提供特征列来预测标签及其概率。
查询数据帧:
+---+---+---+---+---+---+---+---+---+---+----+----+----+
|_c0|_c1|_c2|_c3|_c4|_c5|_c6|_c7|_c8|_c9|_c10|_c11|_c12|
+---+---+---+---+---+---+---+---+---+---+----+----+----+
| 11| 11| 0| 23| 0| 7| 2| 70| 81| 76| 7| 0| 23|
| 4| 0| 0| 0| 0| 0| 2| 2| 3| 2| 7| 0| 2|
+---+---+---+---+---+---+---+---+---+---+----+----+----+
解析的查询数据帧:
+--------------------+--------------------+
| queryValue| features|
+--------------------+--------------------+
|11,11,0,23,0,7,2,...|[11.0,11.0,0.0,23...|
|4,0,0,0,0,0,2,2,3...|(13,[0,6,7,8,9,10...|
+--------------------+--------------------+
来自 RFCModel 的原始预测:
+--------------------+--------------------+--------------------+----------+
| queryValue| features| probability|prediction|
+--------------------+--------------------+--------------------+----------+
|11,11,0,23,0,7,2,...|[11.0,11.0,0.0,23...| [0.67, 0.32]| 0.0|
|4,0,0,0,0,0,2,2,3...|(13,[0,6,7,8,9,10...| [0.05, 0.94]| 1.0|
+--------------------+--------------------+--------------------+----------+
在原始预测中,概率列是一个双精度数组,在相应的类索引中具有概率。假设概率列中的一行是 [0.67,0.32],则表示类0.0的概率为 0.67,类 1.0 的概率为0.32。仅当标签为 0,1,2 时,概率列才有意义...在这种情况下,当我使用 IndexToString 将预测索引回原始标签时,概率列将毫无意义。
索引数据帧:
+--------------------+--------------------+--------------------+----------+
| queryValue| features| probability|prediction|
+--------------------+--------------------+--------------------+----------+
|11,11,0,23,0,7,2,...|[11.0,11.0,0.0,23...| [0.67, 0.32]| green|
|4,0,0,0,0,0,2,2,3...|(13,[0,6,7,8,9,10...| [0.05, 0.94]| red|
+--------------------+--------------------+--------------------+----------+
我想索引回概率列,如下所示,
+--------------------+--------------------+--------------------------+----------+
| queryValue| features| probability |prediction|
+--------------------+--------------------+--------------------------+----------+
|11,11,0,23,0,7,2,...|[11.0,11.0,0.0,23...|{"red":0.32,"green":0.67} | green|
|4,0,0,0,0,0,2,2,3...|(13,[0,6,7,8,9,10...|{"red":0.94,"green":0.05} | red|
+--------------------+--------------------+--------------------------+----------+
现在,我正在通过将数据帧转换为列表来索引概率列。火花中是否有任何功能转换器可以做到这一点?
尝试使用以下方法解决此问题-
我用
Iris data
来解决这个问题。
-
示例输入(前 5 行(
+------------+-----------+------------+-----------+-----------+
|sepal_length|sepal_width|petal_length|petal_width| label|
+------------+-----------+------------+-----------+-----------+
| 5.1| 3.5| 1.4| 0.2|Iris-setosa|
| 4.9| 3.0| 1.4| 0.2|Iris-setosa|
| 4.7| 3.2| 1.3| 0.2|Iris-setosa|
| 4.6| 3.1| 1.5| 0.2|Iris-setosa|
| 5.0| 3.6| 1.4| 0.2|Iris-setosa|
+------------+-----------+------------+-----------+-----------+
从 StringIndexerModel 捕获带有索引的标签
你提到——我使用 StringIndexer 为标签列编制索引,并使用 VectorAssembler 从特征列创建特征向量。
我们将使用此处的stringIndexerModel
来获取Map[index, Label]
// in my case, StringIndexerModel is referenced as labelIndexer
val labelToIndex = labelIndexer.labels.zipWithIndex.map(_.swap).toMap
println(labelToIndex)
结果-
Map(0 -> Iris-setosa, 1 -> Iris-versicolor, 2 -> Iris-virginica)
使用此映射生成概率 json
import org.apache.spark.ml.linalg.Vector
val mapToLabel = udf((vector: Vector) => vector.toArray.zipWithIndex.toMap.map{
case(prob, index) => labelToIndex(index) -> prob
})
predictions.select(
col("features"),
col("probability"),
to_json(mapToLabel(col("probability"))).as("probability_json"),
col("prediction"),
col("predictedLabel"))
.show(5,false)
结果-
+-------------------------------------+------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+----------+--------------+
|features |probability |probability_json |prediction|predictedLabel|
+-------------------------------------+------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+----------+--------------+
|(123,[0,37,82,101],[1.0,1.0,1.0,1.0])|[0.7094347002635046,0.174338768115942,0.11622653162055337] |{"Iris-setosa":0.7094347002635046,"Iris-versicolor":0.174338768115942,"Iris-virginica":0.11622653162055337} |0.0 |Iris-setosa |
|(123,[0,39,58,101],[1.0,1.0,1.0,1.0])|[0.7867074275362319,0.12433876811594202,0.0889538043478261] |{"Iris-setosa":0.7867074275362319,"Iris-versicolor":0.12433876811594202,"Iris-virginica":0.0889538043478261} |0.0 |Iris-setosa |
|(123,[0,39,62,107],[1.0,1.0,1.0,1.0])|[0.5159492704509036,0.2794443583750028,0.2046063711740936] |{"Iris-setosa":0.5159492704509036,"Iris-versicolor":0.2794443583750028,"Iris-virginica":0.2046063711740936} |0.0 |Iris-setosa |
|(123,[2,39,58,101],[1.0,1.0,1.0,1.0])|[0.7822379507920459,0.12164981462756994,0.09611223458038423]|{"Iris-setosa":0.7822379507920459,"Iris-versicolor":0.12164981462756994,"Iris-virginica":0.09611223458038423}|0.0 |Iris-setosa |
|(123,[2,43,62,101],[1.0,1.0,1.0,1.0])|[0.7049652235193186,0.17164981462756992,0.1233849618531115] |{"Iris-setosa":0.7049652235193186,"Iris-versicolor":0.17164981462756992,"Iris-virginica":0.1233849618531115} |0.0 |Iris-setosa |
+-------------------------------------+------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------+----------+--------------+
only showing top 5 rows