r语言 - Logit模型输出不正确



我正在运行Logit模型上的数据,我发现在Kaggle

https://www.kaggle.com/datasets/leonardopena/top50spotify2019

我的目标是预测哪些歌曲将成为国际热门歌曲(TRUE)。这个模型似乎预测了那些不会成为国际热门歌曲的歌曲(错误)。

有人能解释一下为什么模型预测的是FALSE而不是TRUE吗?谢谢大家的帮助。

structure(list(bpm = c(105L, 170L, 120L, 87L, 129L, 125L), 
nrgy = c(72L, 
71L, 42L, 38L, 71L, 94L), dnce = c(72L, 74L, 75L, 72L, 58L, 
74L), dB = c(-7L, -4L, -8L, -8L, -8L, -1L), hit = c(TRUE, 
TRUE, TRUE, FALSE, TRUE, FALSE)), row.names = c(8L, 80L, 
15L, 361L, 42L, 185L), class = "data.frame")
dfTop50 <- read.csv("SpotifyTop50country_prepared.csv", 
row.names = 1, stringsAsFactors = FALSE)
train <- 0.7 
nCases <- nrow(dfTop50)
set.seed(123)
trainCases <- sample(1:nCases, floor(train*nCases))
dfTop50Train <- dfTop50[ trainCases ,]
dfTop50Test <- dfTop50[ -trainCases ,]
mdlA <- hit ~ bpm + nrgy + dnce + dB 
str(mdlA)
rsltLogit <- glm(mdlA, data = dfTop50Train, family = 
binomial("logit"))
predLogit <- predict(rsltLogit, dfTop50Test, type = 
"response")

head(cbind(Observed = dfTop50Test$hit, Predicted = 
predLogit))
predLogit <- factor(as.numeric(predLogit > 0.5),
levels = c(0,1),
labels=c("FALSE","TRUE"))
accLogit <- mean(predLogit == dfTop50Test$hit)
describe(accLogit)
tblLog <- table(Predicted = predLogit,
Observed = dfTop50Test$hit)
View(tblLog)

如果没有您的测试/训练数据或代码来将Kaggle数据转换为相同的格式,则很难确定。然而,使用代码开头的一小段数据,很明显,这些数据上的GLM预测的是hit=TRUE。请注意,预测概率(下面数据中的pr)在命中时大约为1,未命中时大约为0。

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
dat <- structure(list(bpm = c(105L, 170L, 120L, 87L, 129L, 125L), 
nrgy = c(72L, 
71L, 42L, 38L, 71L, 94L), dnce = c(72L, 74L, 75L, 72L, 58L, 
74L), dB = c(-7L, -4L, -8L, -8L, -8L, -1L), hit = c(TRUE, 
                                         TRUE, TRUE, FALSE, TRUE, FALSE)), row.names = c(8L, 80L, 
                                                                                         15L, 361L, 42L, 185L), class = "data.frame")

g <- glm(hit ~ bpm + nrgy + dnce + dB, data=dat, family=binomial)
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
broom::augment(g) %>% 
select(hit, .fitted) %>%
mutate(pr = plogis(.fitted))
#> # A tibble: 6 × 3
#>   hit   .fitted       pr
#>   <lgl>   <dbl>    <dbl>
#> 1 TRUE     25.7 1.00e+ 0
#> 2 TRUE     39.1 1   e+ 0
#> 3 TRUE     23.4 1.00e+ 0
#> 4 FALSE   -23.7 4.95e-11
#> 5 TRUE     24.2 1.00e+ 0
#> 6 FALSE   -25.9 5.87e-12

由reprex包(v2.0.1)于2022-04-18创建

最新更新