我正在使用Scala运行Spark 2.1。我正在尝试将向量数组转换为DenseVector
.
这是我的数据帧:
scala> df_transformed.printSchema()
root
|-- id: long (nullable = true)
|-- vals: vector (nullable = true)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
scala> df_transformed.show()
+------------+--------------------+--------------------+
| id| vals| hashValues|
+------------+--------------------+--------------------+
|401310732094|[-0.37154,-0.1159...|[[-949518.0], [47...|
|292125586474|[-0.30407,0.35437...|[[-764013.0], [31...|
|362051108485|[-0.36748,0.05738...|[[-688834.0], [18...|
|222480119030|[-0.2509,0.55574,...|[[-1167047.0], [2...|
|182270925238|[0.32288,-0.60789...|[[-836660.0], [97...|
+------------+--------------------+--------------------+
例如,我需要将hashValues
列的值提取到 id 401310732094 的DenseVector
中。
这可以通过UDF
来完成:
import spark.implicits._
val convertToVec = udf((array: Seq[Vector]) =>
Vectors.dense(array.flatMap(_.toArray).toArray)
)
val df = df_transformed.withColumn("hashValues", convertToVec($"hashValues"))
这将用包含DenseVector
的新列覆盖hashValues
列。
使用具有以下架构的数据帧进行测试:
root
|-- id: integer (nullable = false)
|-- hashValues: array (nullable = true)
| |-- element: vector (containsNull = true)
结果是:
root
|-- id: integer (nullable = false)
|-- hashValues: vector (nullable = true)