IllegalArgumentException：列的类型必须是 struct<type：tinyint，size：int，indices：<int>array，values：arr

我有一个包含多个分类列的数据框架。我正试图使用两列之间的内置函数来找到平方统计数据：

from pyspark.ml.stat import ChiSquareTest
r = ChiSquareTest.test(df, 'feature1', 'feature2')

然而，它给了我一个错误：

IllegalArgumentException: 'requirement failed: Column feature1 must be of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>> but was actually double.'

feature1的数据类型为：

feature1: double (nullable = true)

你能在这方面帮我吗？

spark-ml不是典型的统计库。它非常面向ML。因此，它假设您希望在标签和一个功能或一组功能之间运行测试。

因此，与训练模型时类似，您需要根据标签组装要测试的功能。

在您的情况下，您可以按如下方式组装feature1：

from pyspark.ml.stat import ChiSquareTest
from pyspark.ml.feature import VectorAssembler
data = [(1, 2), (3, 4), (2, 1), (4, 3)]
df = spark.createDataFrame(data, ['feature1', 'feature2'])
assembler = VectorAssembler().setInputCols(['feature1']).setOutputCol('features')
ChiSquareTest.test(assembler.transform(df), 'features', 'feature2').show(false)

以防万一，scala:中的代码

import org.apache.spark.ml.stat.ChiSquareTest
import org.apache.spark.ml.feature.VectorAssembler
val df = Seq((1, 2, 3), (1, 2, 3), (4, 5, 6), (6, 5, 4))
.toDF("features", "feature2", "feature3")
val assembler = new VectorAssembler()
.setInputCols(Array("feature1"))
.setOutputCol("features")
ChiSquareTest.test(assembler.transform(df), "features", "feature2").show(false)

为了扩展Oli的答案，Spark ML希望特性存储在pyspark.ml.linalg.Vector的实例中。有两种矢量：

密集向量-这些简单的数组包含向量的所有元素，包括所有零，并由类型为array<T>的Spark数组表示
稀疏向量-这些是更复杂的数据结构，只存储向量的非零元素，允许紧凑地存储只有少量非零的巨大向量。稀疏矢量有三个组成部分：
- 一个整数size，表示向量的全维
- 保持非零元素的位置的CCD_ 7阵列
- 保持非零元素值的values数组

这两种向量类型实际上都是使用稀疏向量的结构来表示的，而对于密集向量，indices数组将不使用，values将存储所有值。第一个结构元素type用于区分这两种类型。

因此，如果你看到一个错误，某个东西期望struct<type:tinyint,size:int,indices:array<int>,values:array<double>>，这意味着你应该传递pyspark.ml.linagl.Vector的实例，而不仅仅是数字。

为了生成Vectors，可以使用pyspark.ml.feature.VectorAssembler将一个或多个独立的特征列组装成单个向量列，也可以使用工厂对象pyspark.ml.linalg.Vectors的工厂方法Vectors.dense()(用于密集向量(和Vectors.sparse()(用于稀疏向量(手动构建它们。由于VectorAssembler是在Scala中实现的，因此使用它可能更容易，也更快。要使用显式矢量创建，请参阅PySpark文档中的ChiSquareTest示例。

相关内容

最新更新

热门标签：