如何获取预处理后的特征数量以使用神经网络分类器 pyspark.ml

我正在尝试使用 pyspark.ml 构建神经网络。问题是我正在使用onehotencoder和其他预处理方法来转换分类变量。我的管道中的阶段是：

索引分类要素
使用Onehotencoder
使用矢量汇编器
然后我申请PCA
将"pca特征"提供给神经网络分类器

但问题是我不知道步骤 4 之后的特征数量，无法将其提供给步骤 5 中分类器的"层"。我的问题是，如何获得最终数量的要素？这是我的代码，我没有包括导入和数据加载部分。

stages = []
for c in Categories:
    stringIndexer = StringIndexer(inputCol= c , outputCol=c + "_indexed")
    encoder = OneHotEncoder(inputCol= c + "_indexed", outputCol=c + "_categoryVec")
    stages += [stringIndexer, encoder]
labelIndexer = StringIndexer(inputCol="Target", outputCol="indexedLabel")
final_features = list(map(lambda c: c+"_categoryVec", Categories))+Continuous

assembler = VectorAssembler(
    inputCols= final_features,
    outputCol="features")
pca = PCA(k=20, inputCol="features", outputCol="pcaFeatures")
(train_val, test_val) = train.randomSplit([0.95, 0.05])
num_classes= train.select("Target").distinct().count()
NN= MultilayerPerceptronClassifier(labelCol="indexedLabel", featuresCol='pcaFeatures', maxIter=100,
                                    layers=[????, 5, 5, num_classes], blockSize=10, seed=1234)

stages += [labelIndexer]
stages += [assembler]
stages += [pca]
stages += [NN]
pipeline = Pipeline(stages=stages)
model = pipeline.fit(train_val)

在文档中，输入参数k是主成分的数量。

所以在你的情况下：

pca = PCA(k=20, inputCol="features", outputCol="pcaFeatures")

要素数为 20。

更新

另一种方法是查看其中一个组装向量的长度。

例如，如果您希望步骤 3 之后的长度：

from pyspark.sql.functions import udf, col
nfeatures = assembler.withColumn('len', udf(len, IntegerType())(col('features'))
    .select('len').take(1)

我觉得应该有更好的方法来做到这一点，即不必打电话给take().

相关内容

最新更新

热门标签：