从pyspark导入训练好的管道模型到scala?



是否可以将经过训练的管道模型从pyspark环境加载到scala ?我正在尝试这样做,但我有这个错误

requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.PipelineModel but found class name pyspark.ml.pipeline.PipelineModel

更准确地说,我有一个pyspark管道模型:

pipe = Pipeline(stages=[transformer_1, transformer_2, RandomForestClassifier])
pipe_model = pipe.fit(data)
pipe_model.save("model.model")

当尝试在scala中加载这个模型时

saved_pipeline_model = PipelineModel.load("model.model")

我有上面的错误,当我在org.apache.spark.ml.PipelineModel中查看时,我发现错误来自load函数

def load(
expectedClassName: String,
sc: SparkContext,
path: String): (String, Array[PipelineStage]) = instrumented { instr =>
val metadata = DefaultParamsReader.loadMetadata(path, sc, expectedClassName)

/**
* Load metadata saved using [[DefaultParamsWriter.saveMetadata()]]
*
* @param expectedClassName  If non empty, this is checked against the loaded metadata.
* @throws IllegalArgumentException if expectedClassName is specified and does not match metadata
*/
def loadMetadata(path: String, sc: SparkContext, expectedClassName: String = ""): Metadata = {
val metadataPath = new Path(path, "metadata").toString
val metadataStr = sc.textFile(metadataPath, 1).first()
parseMetadata(metadataStr, expectedClassName)
}

实际上,loadMetada检查expectedClassName是否与元数据文件夹中的相同。

我通过自己编辑元数据解决了这个问题:

{"class":"pyspark.ml.pipeline.PipelineModel","timestamp":1635178267710,..}

{"class":,"timestamp":1635178267710,..}

最新更新