是否可以将经过训练的管道模型从pyspark环境加载到scala ?我正在尝试这样做,但我有这个错误
requirement failed: Error loading metadata: Expected class name org.apache.spark.ml.PipelineModel but found class name pyspark.ml.pipeline.PipelineModel
更准确地说,我有一个pyspark管道模型:
pipe = Pipeline(stages=[transformer_1, transformer_2, RandomForestClassifier])
pipe_model = pipe.fit(data)
pipe_model.save("model.model")
当尝试在scala中加载这个模型时
saved_pipeline_model = PipelineModel.load("model.model")
我有上面的错误,当我在org.apache.spark.ml.PipelineModel中查看时,我发现错误来自load函数
def load(
expectedClassName: String,
sc: SparkContext,
path: String): (String, Array[PipelineStage]) = instrumented { instr =>
val metadata = DefaultParamsReader.loadMetadata(path, sc, expectedClassName)
/**
* Load metadata saved using [[DefaultParamsWriter.saveMetadata()]]
*
* @param expectedClassName If non empty, this is checked against the loaded metadata.
* @throws IllegalArgumentException if expectedClassName is specified and does not match metadata
*/
def loadMetadata(path: String, sc: SparkContext, expectedClassName: String = ""): Metadata = {
val metadataPath = new Path(path, "metadata").toString
val metadataStr = sc.textFile(metadataPath, 1).first()
parseMetadata(metadataStr, expectedClassName)
}
实际上,loadMetada检查expectedClassName是否与元数据文件夹中的相同。
我通过自己编辑元数据解决了这个问题:
{"class":"pyspark.ml.pipeline.PipelineModel","timestamp":1635178267710,..}
{"class":,"timestamp":1635178267710,..}