TypeError:数据应该是 LabeledPoint 的 RDD,但得到<类型'numpy.ndarray'>



我得到错误:

TypeError: data should be an RDD of LabeledPoint, but got <type 'numpy.ndarray'>

执行时:

import sys
import numpy as np
from pyspark import SparkConf, SparkContext
from pyspark.mllib.classification import LogisticRegressionWithSGD

conf = (SparkConf().setMaster("local")
.setAppName("Logistic Regression")
.set("spark.executor.memory", "1g"))
sc = SparkContext(conf = conf) 

def mapper(line):
    feats = line.strip().split(",") 
    label = feats[len(feats) - 1]       # Last column is the label
    feats = feats[2: len(feats) - 1]    # remove id and type column
    feats.insert(0,label)
    features = [ float(feature) for feature in feats ] # need floats
    return np.array(features)
data = sc.textFile("test.csv")
parsedData = data.map(mapper)
# Train model
model = LogisticRegressionWithSGD.train(parsedData)

我在model = LogisticRegressionWithSGD.train(parsedData)行上获得错误。

parsedData应该是RDD。我不确定为什么要得到这个。

github链接到完整源代码

parseddata应该是RDD。我不确定为什么要得到这个。

问题不是parsedData不是RDD,而是它存储的问题。正如消息所说,通过RDD[numpy.ndarray]时需要RDD[LabeledPoint]

from pyspark.mllib.regression import LabeledPoint
def mapper(line):
    ...
    return LabeledPoint(label, features)

最新更新