如何将Spark数据帧列从Array[Int]转换为linalg.矢量

我有一个数据帧df，看起来像这样：

+--------+--------------------+
| user_id|        is_following|
+--------+--------------------+
|       1|[2, 3, 4, 5, 6, 7]  |
|       2|[20, 30, 40, 50]    |
+--------+--------------------+

我可以确认这有模式：

root
|-- user_id: integer (nullable = true)
|-- is_following: array (nullable = true)
|    |-- element: integer (containsNull = true)

我想使用Spark的ML例程(如LDA)对此进行一些机器学习，要求我将is_following列转换为linalg.Vector(而不是Scala向量)。当我尝试通过做到这一点时

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val assembler = new VectorAssembler().setInputCols(Array("is_following")).setOutputCol("features")
val output = assembler.transform(df)

然后我得到以下错误：

java.lang.IllegalArgumentException: Data type ArrayType(IntegerType,true) is not supported.

如果我正确地解释了这一点，那么我需要将这里的类型从整数转换为其他类型。(双字符串)

我的问题是，将这个数组转换为能够为ML管道正确矢量化的数组的最佳方法是什么？

编辑：如果有帮助的话，我不必以这种方式构建数据帧。我可以改为：

+--------+------------+
| user_id|is_following|
+--------+------------+
|       1|           2|
|       1|           3|
|       1|           4|
|       1|           5|
|       1|           6|
|       1|           7|
|       2|          20|
|     ...|         ...|
+--------+------------+

将数组转换为linalg.Vector并同时将整数转换为双精度的简单解决方案是使用UDF。

使用数据帧：

val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
val df = spark.createDataFrame(Seq((1, Array(2,3,4,5,6,7)), (2, Array(20,30,40,50))))
.toDF("user_id", "is_following")
val convertToVector = udf((array: Seq[Int]) => {
Vectors.dense(array.map(_.toDouble).toArray)
})
val df2 = df.withColumn("is_following", convertToVector($"is_following"))

此处导入spark.implicits._以允许使用$，也可以使用col()或'。

打印df2数据帧将得到想要的结果：

+-------+-------------------------+
|user_id|is_following             |
+-------+-------------------------+
|1      |[2.0,3.0,4.0,5.0,6.0,7.0]|
|2      |[20.0,30.0,40.0,50.0]    |
+-------+-------------------------+

架构：

root
|-- user_id: integer (nullable = false)
|-- is_following: vector (nullable = true)

因此，初始输入可能比转换后的输入更适合。Spark的VectorAssembler要求所有列都是Double，而不是Double数组。由于不同的用户可以关注不同数量的人，因此您当前的结构可能很好，您只需要将is_following转换为Double，您实际上可以使用Spark的VectorIndexer来实现这一点https://spark.apache.org/docs/2.1.0/ml-features.html#vectorindexer或者只是在SQL中手动执行。

所以tl；dr是-类型错误是因为Spark的Vector只支持Doubles(在不久的将来，图像数据可能会发生这种变化，但无论如何都不太适合您的用例)，而您的替代结构实际上可能更适合(没有分组的结构)。

您可能会发现，查看Spark文档中的协作过滤示例对您的进一步冒险非常有用https://spark.apache.org/docs/latest/ml-collaborative-filtering.html。祝你好运，玩得开心Spark ML：)

编辑：

我注意到你说你想对输入进行LDA，所以让我们看看如何为该格式准备数据。对于LDA输入，您可能需要考虑使用CountVectorizer(请参阅https://spark.apache.org/docs/2.1.0/ml-features.html#countvectorizer)

相关内容

最新更新

热门标签：