循环查找python中R2的最大值

我正在尝试做出决策树，但优化采样值使用。

我使用一组值，如:

DATA1 DATA2 DATA3 VALUE100 300 400 1.6102 298 405 1.588 275 369120 324 417 0.9103 297 404 1.7110 310 423 1.1105 297 401 0.7099 309 397 1.6．．.

我的任务是做一个决策树，以便从Data1, Data2和Data3能够预测要预测的数据值。

我已经开始进行分类森林，结果给了我一个决定系数。我把它附在下面:

#Datos
X = dfs.drop(columns='Dato a predecir')
y = dfs.Datos a predecir
# 70 % del conjunto de datos para entrenamiento y 30 % para validación
X_train, X_test, y_train, y_test = train_test_split(X, y,
train_size = 0.7,
random_state = 0,
)
# Crear el modelo para ajustar
bosque = RandomForestClassifier(n_estimators=71,
criterion="gini",
max_features="sqrt",
bootstrap=True,
max_samples = 2/3,
oob_score=True
)

bosque.fit(X_train, y_train)
y_pred = bosque.predict(X_test)

r, p = stats.pearsonr(y_pred,y_test)
print(f"Correlación Pearson: r={r}, p-value={p}")

好吧，从这段代码开始，感谢"bootstrap=True"每次运行代码时，我都设法获得一组新的训练数据和一个新的确定系数。

谁能帮我循环这段代码，以获得决定系数的最大值，并保存使用的训练数据，以便我可以做出最优的决策树?

我已经尝试执行for循环，但它并没有真正工作。内容如下:

for i in range (10000):
while r <1:
Arbol_decisión(X,y)
r=r
i=i+1

使用的范围是，它不代表我拥有的所有数据，我需要找到我的数据和字母" "的最大可能组合。表示决定系数的值。我知道我所做的循环是愚蠢的，但事实是我想不出如何实现它。你能帮我吗?

非常感谢你所做的一切。

我尝试能够执行循环以获得尽可能多的矩阵并优化我的决策树

首先，如果要像这样处理它，则需要使用验证集和测试集。否则，你只会得到有偏差的结果，并且很可能是一个与测试数据过拟合的模型。

其次，如果你只是随机采样你的数据(这就是bootstrap所做的)，那么所有这些结果都告诉你，你的数据集不是很好。理想情况下，数据集应该表示来自底层分布的样本。因此，使用更多的数据会更好，因为您的模型可以更有效地学习底层分布。在您的情况下，您正在从一些数据不代表底层分布的角度处理问题(这就是为什么您要忽略它)。如果是这种情况，那么您应该提前正确地清理数据。如果你不能找到一种方法来识别这些"坏"数据点，那么我不建议你在这上面瞎折腾——因为你只是在挑选数据，产生一个坏的模型。

我通常会建议你暂停编写代码，阅读更多关于决策树、随机森林和自举背后的理论。否则你可能只会设计出糟糕的ML实验。

如果出于某种原因你认为这仍然是一个好方法()，它几乎肯定不是)，然后你自己来引导……类似于下面的代码(可能有一个更优化的解决方案)。


X = np.arange(1000)
y = np.arange(1000)/100
# Selecting random train/val/test dataset
# Define slices for 60% train, 20% val, 20% test
train_size = slice(0, int(len(X) * 0.6))
val_size = slice(int(len(X) * 0.6), int(len(X) * 0.8))
test_size = slice(int(len(X) * 0.8), int(len(X) * 1))
# Randomise the indices corresponding to X and y
# (same size so only do once)
rnd_idx = np.random.choice(np.arange(len(X)),
len(X),
replace=False)
# Loop through the three dataset sizes and select randomised,
# non-overlapping data for them.
X_tr, X_va, X_te = [X[rnd_idx[sliced]] for sliced in [train_size, val_size, test_size]]
y_tr, y_va, y_te = [X[rnd_idx[sliced]] for sliced in [train_size, val_size, test_size]]
###
### Define random forest here
###
# Define the bootstrap size and method
# Here we are sub-selecting 90% of the training data
bootstrap_size = slice(0, int(len(X_tr) * 0.9))
# And using replacement, so can expect ~30% duplicates of data.
replace = True
# Define an acceptable threshold for performance
acceptable_r = 0.9
# Set initial value (non-physically low)
r = -10
# Do a while loop that repeats until the performance is appropriate
while r < acceptable_r:
# Create randomised indices corresponding to the training set
rnd_idx2 = np.random.choice(np.arange(len(X_tr)),
len(X_tr),
replace=replace)
# Subselect the bootstrapped training data
X_tr_s = X_tr[rnd_idx2[bootstrap_size]]
y_tr_s = y_tr[rnd_idx2[bootstrap_size]]
###
### Fit model here
###
###
### Apply to validation data here
###
###
### Calculate metric here
###

r = r
###
### Apply to testing data here
###

一旦while循环退出，您就可以检索相应的训练数据、索引和模型等。

相关内容

最新更新

热门标签：