如何在使用电火花ALS时使RMSE(均方根误差)变小

我需要一些建议来建立一个好的模型，以便使用spark的Collaborative Filtering进行推荐。官方网站上有一个示例代码。我还通过了以下内容：

from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating
# Load and parse the data
data = sc.textFile("data/mllib/als/test.data")
ratings = data.map(lambda l: l.split(','))
   .map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
# Build the recommendation model using Alternating Least Squares
rank = 10
numIterations = 10
model = ALS.train(ratings, rank, numIterations)
# Evaluate the model on training data
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
RMSE = ratesAndPreds.map(lambda r: ((r[1][0] - r[1][1])**2).mean())**.5)
print("Root Mean Squared Error = " + str(RMSE))

一个好的模型需要RMSE尽可能小。

这是因为我没有为ALS.train方法设置合适的参数，比如rand-numIterations等等吗？
还是因为我的数据集很小，RMSE就大了？

所以有人能帮我弄清楚RMSE大的原因是什么以及如何解决它吗？

添加：

正如@eliasah所说，我需要添加一些细节来缩小答案集。让我们考虑一下这种特殊情况：

现在，如果我想建立一个推荐系统，向我的客户推荐音乐。我有他们的曲目、专辑、艺术家和流派的历史评分。显然，这4个类构建了一个层次结构。曲目直接属于专辑，专辑直接属于艺术家，艺术家可能属于几个different流派。最后，我想使用所有这些信息来选择一些曲目推荐给客户。

那么，为这些情况建立一个好的模型并确保RMSE尽可能小以进行预测的最佳实践是什么呢。

如上所述，在给定相同数据集的情况下，随着秩和numIterations的增加，RMSE降低但是，随着数据集的增长，RMSE也会增加。

现在，降低RMSE和其他一些类似措施的一种做法是将评级中的值标准化。根据我的经验，当你提前知道最小和最大评级值时，这真的很有效。

此外，您还应该考虑使用RMSE以外的其他措施。在进行矩阵分解时，我发现有用的是计算评级的Frobenius范数-预测然后除以评级的Frubenius范数这样做，你就得到了你的预测相对于原始评级的相对误差。

以下是这种方法的spark代码：

# Evaluate the model on training data
testdata = ratings.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = ratings.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
abs_frobenius_error = sqrt(ratesAndPreds.map(lambda r: ((r[1][0] - r[1][1])**2).sum())))
# frobenius error of original ratings
frob_error_orig = sqrt(ratings.map(lambda r: r[2]**2).sum())
# finally, the relative error
rel_error = abs_frobenius_error/frob_error_orig
print("Relative Error = " + str(rel_error))

在这个误差度量中，误差越接近零，你的模型就越好

我希望这能有所帮助。

我对此做了一些研究，得出的结论是：

当rand和迭代增长时，RMSE将降低。然而，当数据集的大小增长时，RMSE将增加。从以上结果来看，兰特规模将对RMSE值产生更大的变化

我知道这还不足以得到一个好的模型。希望有更多的想法！！！

在pyspark中，使用它来查找均方根误差（rmse）

from pyspark.mllib.recommendation import ALS
from math import sqrt
from operator import add

# rank is the number of latent factors in the model.
# iterations is the number of iterations to run.
# lambda specifies the regularization parameter in ALS
rank = 8
num_iterations = 8
lmbda = 0.1
# Train model with training data and configured rank and iterations
model = ALS.train(training, rank, num_iterations, lmbda)

def compute_rmse(model, data, n):
    """
    Compute RMSE (Root Mean Squared Error), or square root of the average value
        of (actual rating - predicted rating)^2
    """
    predictions = model.predictAll(data.map(lambda x: (x[0], x[1])))
    predictions_ratings = predictions.map(lambda x: ((x[0], x[1]), x[2])) 
      .join(data.map(lambda x: ((x[0], x[1]), x[2]))) 
      .values()
    return sqrt(predictions_ratings.map(lambda x: (x[0] - x[1]) ** 2).reduce(add) / float(n))
print "The model was trained with rank = %d, lambda = %.1f, and %d iterations.n" % 
        (rank, lmbda, num_iterations)
# Print RMSE of model
validation_rmse = compute_rmse(model, validation, num_validation)
print "Its RMSE on the validation set is %f.n" % validation_rmse

相关内容

最新更新

热门标签：