我正试图使用R语言构建一个决策树模型,当我运行rpart((函数时,Rstudio会冻结。我提供了一个到我使用的数据集的链接,代码也处理它,直到建立决策树模型,任何帮助都会得到的赞赏
https://github.com/ArcanePersona/files/blob/main/vgsales.csv
#Libraries used:
library(tidyverse)
library(Hmisc)
library(mctest)
library(rpart)
library(rpart.plot)
library(RColorBrewer)
library(rattle)
library(missForest)
library(VIM)
library(caret)
library(fmsb)
#Phase One: Data Preprocessing:
#Loading in the "vgsales.csv" data:
game_sales <- read.csv("vgsales.csv", header = T, stringsAsFactors = F)
#turning the structure of the data to tibble for ease of use:
game_sales <- as_tibble(game_sales)
#Replacing the "N/A" character values in Year_of_Release with real NA values:
game_sales %>% filter(game_sales$Year_of_Release == "N/A")
game_sales <- game_sales %>% mutate( Year_of_Release = gsub("N/A","", Year_of_Release))
#Changing the data type of column Year_of_release from "chr" to "int":
game_sales$Year_of_Release <- as.integer(game_sales$Year_of_Release)
str(game_sales$Year_of_Release)
#Imputing Year_of_Release variable and inserting the imputed values:
imputeyear <- with(game_sales,Hmisc::impute(game_sales$Year_of_Release, 'mean'))
game_sales <- game_sales %>% mutate (Year_of_Release = imputeyear)
#filtering data for "year_of_release" >= 2010 then ordering data ascending:
game_sales <- game_sales %>% filter(Year_of_Release >= 1991) %>% filter(Year_of_Release <=2010)
#Creating a subset of not NA values in the Rating variable
#Because the missing data is too many and not imputable (50%)
#This subset are for machine learning purposes only:
ml_subset_x <- subset(game_sales, !is.na(game_sales$Critic_Score) | !is.na(game_sales$Critic_Count))
ml_subset_y <- ml_subset_x %>% filter( Rating == "E"| Rating == "M" | Rating =="T" |
Rating == "E10+"| Rating == "AO" | Rating =="K-A" | Rating =="RP")
#Phase Four: Machine Learning:
#Decision Tree:
ml_subset_y$Publisher <- as.factor(ml_subset_y$Publisher)
ml_subset_y$Platform <- as.factor(ml_subset_y$Platform)
ml_subset_y$Genre <- as.factor(ml_subset_y$Genre)
ml_subset_y$Rating <- as.factor(ml_subset_y$Rating)
#Splitting data into train (70%) and test (30%):
set.seed(1234)
index <- sample(nrow(ml_subset_y), 0.7 * nrow(ml_subset_y))
ml_subset_ytrain <- ml_subset_y[index,]
ml_subset_ytest <- ml_subset_y[-index,]
#Modelling the train data using decision tree algorithm:
treemodel <- rpart(Rating~., data=ml_subset_ytrain)
plot(treemodel, margin=0.25)
text(treemodel, use.n=T)
fancyRpartPlot(treemodel)
#Testing the model using the test data and using confusion matrix
#to check Accuracy:
prediction <- predict(treemodel, newdata=ml_subset_ytest, type='class')
accuracy_test <- table(prediction, ml_subset_ytest$Rating)
confusionmatrix(accuracy_test)
您的代码中存在多个问题。我会尽量说清楚的。
首先,您将列";释放年份";到整数,这是好的。但是,您的变量imputeyear是一个字符向量,当您使用mutate(Year_of_Release = imputeyear)
时,这会将其转换回。
我重写了代码的第一部分。然后您可以看到,您需要小心处理一些变量("chr"变量(。
最后,在你的一组自变量中,变量Name应该被删除,因为使用它没有意义,函数也无法处理它。然后,我认为Publisher变量的252个级别对rpart算法来说可能有点多。移除它们,功能就会正常工作。在转换为因子之前,您可以尝试过滤20-30个不同发布者的数据,然后尝试查看该函数是否适用于较少级别。
希望它能有所帮助;(
#Libraries used:
library(tidyverse)
library(Hmisc)
library(mctest)
library(rpart)
library(rpart.plot)
library(RColorBrewer)
library(rattle)
library(missForest)
library(VIM)
library(caret)
library(fmsb)
#Phase One: Data Preprocessing:
#Loading in the "vgsales.csv" data:
#I first transformed to NA to do it like you did but
#you could directly use recode or mutate with an ifelse
#to do it in One step.
game_sales <- read.csv("vgsales.csv", header = T, stringsAsFactors = F) %>%
as_tibble() %>%
mutate(Year_of_Release = as.integer(na_if(Year_of_Release, "N/A"))) %>%
mutate(Year_of_Release = replace_na(Year_of_Release,
round(mean(.$Year_of_Release[!is.na(.$Year_of_Release)]), 0))) %>%
filter(between(Year_of_Release, 1991, 2010))
str(game_sales) #--> Be careful with Name, Platform, Genre, Publisher and Rating.
#Creating a subset of not NA values in the Rating variable
#Because the missing data is too many and not imputable (50%)
#This subset are for machine learning purposes only:
ml_subset_x <- subset(game_sales, !is.na(game_sales$Critic_Score) | !is.na(game_sales$Critic_Count))
ml_subset_y <- ml_subset_x %>% filter( Rating == "E"| Rating == "M" | Rating =="T" |
Rating == "E10+"| Rating == "AO" | Rating =="K-A" | Rating =="RP")
#Phase Four: Machine Learning:
#Decision Tree:
ml_subset_y$Publisher <- as.factor(ml_subset_y$Publisher)
ml_subset_y$Platform <- as.factor(ml_subset_y$Platform)
ml_subset_y$Genre <- as.factor(ml_subset_y$Genre)
ml_subset_y$Rating <- as.factor(ml_subset_y$Rating)
#Splitting data into train (70%) and test (30%):
set.seed(1234)
index <- sample(nrow(ml_subset_y), 0.7 * nrow(ml_subset_y))
ml_subset_ytrain <- ml_subset_y[index,]
ml_subset_ytest <- ml_subset_y[-index,]
str(ml_subset_ytrain) # --> Name should be removed from this table or the formula be explicit, your choice.
# --> The 252 levels of the publisher variable are problematic.
ml_subset_ytrain = select(ml_subset_ytest, -c(Name, Publisher))
#Modelling the train data using decision tree algorithm:
treemodel <- rpart(Rating ~ ., data = ml_subset_ytrain) # Now it works ;)
plot(treemodel, margin=0.25)
text(treemodel, use.n=T)
fancyRpartPlot(treemodel)
#Testing the model using the test data and using confusion matrix
#to check Accuracy:
prediction <- predict(treemodel, newdata=ml_subset_ytest, type='class')
accuracy_test <- table(prediction, ml_subset_ytest$Rating)
confusionmatrix(accuracy_test)