PySpark, Decision Trees (Spark 2.0.0)

我是Spark的新手(使用PySpark(。我试着从这里运行决策树教程(链接(。我执行代码：

from pyspark.ml import Pipeline
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.feature import StringIndexer, VectorIndexer
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.mllib.util import MLUtils
# Load and parse the data file, converting it to a DataFrame.
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
labelIndexer = StringIndexer(inputCol="label", outputCol="indexedLabel").fit(data)
# Now this line fails
featureIndexer =
    VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=4).fit(data)

我收到错误消息：

IllegalArgumentException:u'请求失败：列功能的类型必须为org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7但实际上org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.'

当在网上搜索这个错误时，我发现了一个答案，上面写着：

使用
from pyspark.ml.linalg import Vectors, VectorUDT
而不是
from pyspark.mllib.linalg import Vectors, VectorUDT

这很奇怪，因为我还没有使用过它。此外，将此导入添加到我的代码中不会解决任何问题，我仍然会遇到同样的错误。

我不太清楚如何调试这种情况。当查看原始数据时，我看到：

data.show()
+--------------------+-----+
|            features|label|
+--------------------+-----+
|(692,[127,128,129...|  0.0|
|(692,[158,159,160...|  1.0|
|(692,[124,125,126...|  1.0|
|(692,[152,153,154...|  1.0|

这看起来像一个列表，以"("开头。

我不知道如何解决这个问题，甚至不知道如何调试。

问题的根源似乎是执行spark 1.5.2。spark 2.0.0上的示例(请参阅下面对spark 2.0示例的参考(。

spark.ml和spark.mllib之间的差异

从Spark 2.0开始，Spark.mllib包中基于RDD的API已进入维护模式。Spark的主要机器学习API现在是Spark.ml包中基于DataFrame的API

更多详细信息，请点击此处：http://spark.apache.org/docs/latest/ml-guide.html

使用spark 2.0，请尝试spark 2.0.0示例(https://spark.apache.org/docs/2.0.0/mllib-decision-tree.html)

from pyspark.mllib.tree import DecisionTree, DecisionTreeModel
from pyspark.mllib.util import MLUtils
# Load and parse the data file into an RDD of LabeledPoint.
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# Split the data into training and test sets (30% held out for testing)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# Train a DecisionTree model.
#  Empty categoricalFeaturesInfo indicates all features are continuous.
model = DecisionTree.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
                                     impurity='gini', maxDepth=5, maxBins=32)
# Evaluate model on test instances and compute test error
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(lambda (v, p): v != p).count() / float(testData.count())
print('Test Error = ' + str(testErr))
print('Learned classification tree model:')
print(model.toDebugString())
# Save and load model
model.save(sc, "target/tmp/myDecisionTreeClassificationModel")
sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")

在Spark repo中的"examples/src/main/python/mllib/devicion_tree_classification_example.py"中查找完整的示例代码。

相关内容

最新更新

热门标签：