决策树中sample_weight和min_samples_split的交互作用



In sklearn.ensemble。RandomForestClassifier,如果我们同时定义sample_weightmin_samples_split,样本权重是否会影响min_samples_split。例如,min_sample_split = 20,且样本中数据点的权重均为2,则有10个数据点满足min_sample_split条件?

不,见源码;min_samples_split不考虑样本权值。对比min_samples_leafmin_weight_fraction_leaf(来源)。

你的例子建议一个简单的实验来检查:

from sklearn.tree import DecisionTreeClassifier
import numpy as np
X = np.array([1, 2, 3]).reshape(-1, 1)
y = [0, 0, 1]
tree = DecisionTreeClassifier()
tree.fit(X, y)
print(len(tree.tree_.feature))  # number of nodes
# 3
tree.set_params(min_samples_split=10)
tree.fit(X, y)
print(len(tree.tree_.feature))
# 1
tree.set_params(min_samples_split=10)
tree.fit(X, y, sample_weight=[20, 20, 20])
print(len(tree.tree_.feature))
# 1; the sample weights don't count to make 
#    each sample "large" enough for min_samples_split

相关内容

  • 没有找到相关文章

最新更新