R和Python中随机森林回归的不同结果



我在R和Python中使用相同的数据进行随机森林回归,但我得到的R2值非常不同。我知道超参数可能是这背后的原因,但我不认为这会导致R2分数几乎减半。我正在使用以下代码并获得相应的结果。

在Python中-

X =  data.drop(['response'],axis=1)
y = data['response'] 


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.05, random_state = 42)
rdf = RandomForestRegressor(n_estimators=500,oob_score=True)
rdf.fit(X_train, y_train)
print("Random Forest Model Score (on Train)" , ":" , rdf.score(X_train, y_train)*100 , "," ,
"Random Forest Model Score (on Test)" ,":" , rdf.score(X_test, y_test)*100)   
y_predicted = rdf.predict(X_train)
y_test_predicted = rdf.predict(X_test)
print("Training RMSE", ":", rmse(y_train, y_predicted),
"Testing RMSE", ":", rmse(y_test, y_test_predicted))

>Random Forest Model Score (on Train) : 92.2312123 , Random Forest Model Score (on Test) : 78.1812321
>Training RMSE : 5.606443558164292e-06   Testing RMSE : 9.59221499904858e-06

在R-

> rows <- sample(0.95*nrow(data))
> train_random <- data[rows,]
> test_random <-  data[-rows,]
> rf_model <- randomForest(response ~ . ,
data = train_random,
keep.forest=TRUE,
importance=TRUE
)
> rf_model
Call:
randomForest(formula = response ~ ., data = train_random, keep.forest = TRUE, importance = TRUE) 
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 6
Mean of squared residuals: 1.437236e-06
% Var explained: 42.05
> pred_train <- predict(rf_model,train_random)
> pred_test <- predict(rf_model,test_random)
> R2_Score(pred_train, train_random$response)
[1] 0.9014311
> R2_Score(pred_test, test_random$response)
[1] 0.3616823

我知道测试序列分割不会导致相同的分割,但为什么我得到如此明显不同的R2值,以及在R中执行相同随机森林的方法是什么。我尝试过使用从Python中得到的相同超参数,但这无助于我在R中得到相同的R2值。有人能帮我吗?

正如其他人所评论的,随机森林有一个随机组件,您可能已经知道了。

但随机森林也使用自举,每次运行都可以改变结果。我已经包含了一个进一步学习的链接。希望这能帮助你找到想要的答案。

https://stats.stackexchange.com/questions/120446/different-results-from-several-passes-of-random-forest-on-same-dataset

最新更新