随机森林class_weight和sample_weight参数

我有一个类不平衡问题，并且一直在使用scikit-learn中的实现（>= 0.16）尝试加权随机森林。

我注意到该实现在树构造函数中采用class_weight参数，在 fit 方法中采用sample_weight参数来帮助解决类不平衡问题。这两者似乎相乘以决定最终的权重。

我无法理解以下内容：

这些权重在树构建/训练/预测的哪个阶段使用？我看过一些关于加权树的论文，但我不确定scikit实现了什么。
class_weight和sample_weight到底有什么区别？

RandomForest建立在树上，这些树有很好的记录。检查树如何使用样本权重：

决策树用户指南 - 准确告知使用的算法
决策树 API - 解释树如何使用sample_weight（对于随机林，正如您已经确定的那样，这是class_weight和sample_weight的乘积）。

至于class_weight和sample_weight之间的区别：很多事情可以简单地由它们的数据类型的性质决定。 sample_weight是长度n_samples的1D数组，为每个用于训练的例子分配一个明确的权重。 class_weight要么是每个类的字典，到该类的统一权重（例如，{1:.9, 2:.5, 3:.01}），要么是一个字符串，告诉sklearn如何自动确定这个字典。

因此，给定示例的训练权重是显式命名为 sample_weight（如果未提供sample_weight则1）的乘积，并且它是class_weight的（如果未提供class_weight则1）。

如果我们看一下源代码，RandomForestClassifier是从类ForestClassifier子类，而类又是从类BaseForest子类，fit()方法实际上是BaseForest类定义的。正如OP所指出的，class_weight和sample_weight之间的相互作用决定了用于拟合随机森林的每个决策树的样本权重。

如果我们检查_validate_y_class_weight()、fit()和_parallel_build_trees()方法，我们可以更好地理解class_weight、sample_weight和bootstrap参数之间的相互作用。特别

如果class_weight传递给RandomForestClassifier()构造函数，但没有sample_weight传递给fit()，则class_weight用作样本权重
如果sample_weight和class_weight都通过，则将它们相乘以确定用于训练每个决策树的最终样本权重
如果class_weight=None，则sample_weight确定最终样本权重（默认情况下，如果为 None，则样本权重相等）。

源代码中的相关部分可以总结如下。

from sklearn.utils import compute_sample_weight
if class_weight == "balanced_subsample" and not bootstrap:
    expanded_class_weight = compute_sample_weight("balanced", y)
elif class_weight is not None and class_weight != "balanced_subsample" and bootstrap:
    expanded_class_weight = compute_sample_weight(class_weight, y)
else:
    expanded_class_weight = None
if expanded_class_weight is not None:
    if sample_weight is not None:
        sample_weight = sample_weight * expanded_class_weight
    else:
        sample_weight = expanded_class_weight

在bootstrap=True中，观测值被随机选择为训练的单个树，这是通过相关（删节）代码的fit()的sample_weight参数完成的，如下所示。

if bootstrap:
    if sample_weight is None:
        sample_weight = np.ones((X.shape[0],), dtype=np.float64)
    indices = check_random_state(tree.random_state).randint(X.shape[0], n_samples_bootstrap)
    sample_counts = np.bincount(indices, minlength=X.shape[0])
    sample_weight *= sample_counts
    if class_weight == "balanced_subsample":
        sample_weight *= compute_sample_weight("balanced", y, indices=indices)

相关内容

最新更新

热门标签：