将向量数组转换为密集向量

我正在使用Scala运行Spark 2.1。我正在尝试将向量数组转换为DenseVector.

这是我的数据帧：

scala> df_transformed.printSchema()
root
 |-- id: long (nullable = true)
 |-- vals: vector (nullable = true)
 |-- hashValues: array (nullable = true)
 |    |-- element: vector (containsNull = true)
scala> df_transformed.show()
+------------+--------------------+--------------------+
|          id|                vals|          hashValues|
+------------+--------------------+--------------------+
|401310732094|[-0.37154,-0.1159...|[[-949518.0], [47...|
|292125586474|[-0.30407,0.35437...|[[-764013.0], [31...|
|362051108485|[-0.36748,0.05738...|[[-688834.0], [18...|
|222480119030|[-0.2509,0.55574,...|[[-1167047.0], [2...|
|182270925238|[0.32288,-0.60789...|[[-836660.0], [97...|
+------------+--------------------+--------------------+

例如，我需要将hashValues列的值提取到 id 401310732094 的DenseVector中。

这可以通过UDF来完成：

import spark.implicits._
val convertToVec = udf((array: Seq[Vector]) => 
  Vectors.dense(array.flatMap(_.toArray).toArray)
)
val df = df_transformed.withColumn("hashValues", convertToVec($"hashValues"))

这将用包含DenseVector的新列覆盖hashValues列。

使用具有以下架构的数据帧进行测试：

root
 |-- id: integer (nullable = false)
 |-- hashValues: array (nullable = true)
 |    |-- element: vector (containsNull = true)

结果是：

root
 |-- id: integer (nullable = false)
 |-- hashValues: vector (nullable = true)

相关内容

最新更新

热门标签：