我有一个带有此结构的RDD
org.apache.spark.rdd.RDD[(Long, org.apache.spark.mllib.linalg.Vector)]
在这里,RDD的每一行都包含索引Long
和一个向量org.apache.spark.mllib.linalg.Vector
。我想将以下函数应用于每个向量,都存在于每一行中。
函数是:sum(vi * ln(vi)),其中vi = vector的ITH组件。
请指导我如何将此功能应用于Scala中提到的结构的RDD。
一个示例行看起来像这样:
Array[(Long, org.apache.spark.mllib.linalg.Vector)] =
Array((0,[0.024866109194373365,0.025451635045582396,0.024940244042347803,
0.025318245591768037,0.026531498776299952,0.02335951025503321,
0.02388238099930112,0.023397342214386187,0.024965559145567116,
0.023650490684903713,0.023343404489401316,0.024368157919182634,
0.02526665811061871,0.025846888476461573,0.025874255477319974))
我们可以尝试将您的Vector
列转换为类型Array
,因此我们可以将x * log(x)
映射到每个元素,最后sum
使用第二个mapValues
调用:
Array
。 rdd.mapValues(_.toArray.map(x => scala.math.log(x) * x)).mapValues(_.sum)