TypeError: 'JavaPackage' 对象在 PySpark 中不可调用 Xgboost



我正在努力让Scala Xgboost API可以用于我的PySpark笔记本。关注这个博客:https://towardsdatascience.com/pyspark-and-xgboost-integration-tested-on-the-kaggle-titanic-dataset-4e75a568bdb然而,继续运行以下错误:

spark._jvm.ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator
<py4j.java_gateway.JavaPackage at 0x7fa650fe7a58>
from sparkxgb import XGBoostEstimator
xgboost = XGBoostEstimator(
featuresCol="features", 
labelCol="Survival", 
predictionCol="prediction"
)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-1765fb9e3344> in <module>
4     featuresCol="features",
5     labelCol="Survival",
----> 6     predictionCol="prediction"
7 )
~/spark-assembly-2.4.0-twttr-kryo3-scala2128-hadoop2.9.2.t05/python/pyspark/__init__.py in wrapper(self, *args, **kwargs)
108             raise TypeError("Method %s forces keyword arguments." % func.__name__)
109         self._input_kwargs = kwargs
--> 110         return func(self, **kwargs)
111     return wrapper
112 
~/local/spark-3536cd7a-6188-4ca8-b3d0-57d42cd01531/userFiles-0a0d90bc-96b4-43f2-bf21-00ae0e6f7309/sparkxgb.zip/sparkxgb/xgboost.py in __init__(self, checkpoint_path, checkpointInterval, missing, nthread, nworkers, silent, use_external_memory, baseMarginCol, featuresCol, labelCol, predictionCol, weightCol, base_score, booster, eval_metric, num_class, num_round, objective, seed, alpha, colsample_bytree, colsample_bylevel, eta, gamma, grow_policy, max_bin, max_delta_step, max_depth, min_child_weight, reg_lambda, scale_pos_weight, sketch_eps, subsample, tree_method, normalize_type, rate_drop, sample_type, skip_drop, lambda_bias)
113 
114         super(XGBoostEstimator, self).__init__()
--> 115         self._java_obj = self._new_java_obj("ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator", self.uid)
116         self._create_params_from_java()
117         self._setDefault(
~/spark-assembly-2.4.0-twttr-kryo3-scala2128-hadoop2.9.2.t05/python/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args)
65             java_obj = getattr(java_obj, name)
66         java_args = [_py2java(sc, arg) for arg in args]
---> 67         return java_obj(*java_args)
68 
69     @staticmethod
TypeError: 'JavaPackage' object is not callable

我已经在谷歌上搜索了这个错误,并尝试了以下内容。我从这个博客中得到了所有的想法https://github.com/JohnSnowLabs/spark-nlp/issues/232:

  1. 确保Xgboost4j在SPARK_DIST_CLASSPATH中。已检查
$echo $SPARK_DIST_CLASSPATH |  tr " " "n" | grep 'xgboost4j' | rev | cut -d'/' -f1 | rev
xgboost4j-0.72.jar
xgboost4j-spark.72.jar
  1. 确保将它们添加到EXTRA_CLASSPATH。-完成
  2. 正在更新配置
'export PYSPARK_SUBMIT_ARGS="--conf spark.jars=$SPARK_HOME/jars/* --conf spark.driver.extraClassPath=$SPARK_HOME/jars/* --conf spark.executor.extraClassPath=$SPARK_HOME/jars/* pyspark-shell"',

硬件信息:

  • 机器:Linux
  • 使用Jupyter笔记本
  • Spark版本2.4.0
  • 蟒蛇3.6

我发现了问题,问题是sparkxbg.zip(我通过互联网下载的(是为xgboost4j-0.72编写的。然而,我的罐子来自xgoost4j-0.9。API已经完全改变了。结果,0.9版本没有任何名为ml.dmlc.xgboost4j.scala.spark.XGBoostEstimator的类。因此出现了错误。你可以看到API的差异如下:

https://github.com/dmlc/xgboost/tree/release_0.72/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark

https://github.com/dmlc/xgboost/tree/v0.90/jvm-packages/xgboost4j-spark/src/main/scala/ml/dmlc/xgboost4j/scala/spark

最新更新