ROC分析中的决策边界

我是机器学习的新手，我发现了非常好的LASSO回归包SIAMCAT。

使用该包的出版物"粪便微生物群在癌症早期检测中的潜力"(http://europepmc.org/article/MED/25432777)"描述得很好，这样我就可以很容易地接近复杂的机器学习方法。然而，有一件事我无法理解他们的方法，所以我礼貌地请求帮助。

我的问题是决策边界的价值。作者说他们的决策边界是0.275(图1A，2A(。然而，我不知道价值来自哪里。

如果作者能回答这个问题，我想问一下价值是从哪里来的(某0.275值(

非常感谢

决策边界基于模型预测。在SIAMCAT中，您可以使用pred_matrix访问器访问模型预测：

# load packages
library("tidyverse")
library("SIAMCAT")
# extract the mean predictions as data.frame
pred <- enframe(rowMeans(pred_matrix(siamcat_example)),
name='Sample_ID', value='prediction')
# get the metadata information for the same dataset also as data.frame
meta.data <- as_tibble(meta.crc.zeller, rownames = 'Sample_ID')
# join the two data.frames together
df.plot <- full_join(pred, meta.data)
df.plot %>% 
ggplot(aes(x=Group, y=prediction, fill=Group)) + 
geom_boxplot()

如果你现在为预测选择了一个随机的截止点，你可以把这个截止点以上的所有东西都称为正，把它以下的所有东西称为负。通过这种方式，您将获得预测截止值的假阳性率和真阳性率：

cutoff <- 0.55
df.plot %>% 
mutate(positive=prediction > cutoff) %>% 
group_by(Group) %>% 
summarise(rate=sum(positive)/n())

# A tibble: 2 x 2
Group  rate
<fct> <dbl>
1 CRC   0.585
2 CTR   0.148

这里，假阳性率为0.148，真阳性率为0.585。

对于AUROC的计算，你基本上要经过每一个可能的预测截止点或决策边界，然后记录此时的真阳性率和假阳性率。然后将这两个矢量相互绘制，得到ROC曲线。

在你提到的论文中(Zeller等人(，作者选择了一个特定的假阳性率来评估他们的模型。因此，他们检查了哪个决策边界会产生所需的假阳性率。

该信息存储在SIAMCAT对象的评估数据中的roc对象中：

# get the roc curve out of the siamcat object
roc.all <- eval_data(siamcat_example)$roc
# the roc object contains all decision boundaries and the resulting 
# sensitivity/specificities
# if we are interested in a false positive rate of 10%, this means
# we have to find the decision boundary corresponding to a specificity of 90%
# (specifity=1-fpr)
idx <- which(roc.all$specificities > 0.90)[1]
boundary <- roc.all$thresholds[idx]
# generate a similar plot as figure 2a in zeller et al.
df.plot %>% 
# sort the samples by predictions and turn them into a factor to preserve
# the order
arrange(prediction) %>% 
mutate(Sample_ID=factor(Sample_ID, levels=Sample_ID)) %>% 
# then, generate the relative rank
group_by(Group) %>% 
mutate(rel_rank=seq_len(n())) %>% 
mutate(rel_rank=rel_rank/n()) %>% 
ungroup() %>% 
# additionally, check if the samples are above or below the decision boundary
mutate(predicted_positive=prediction > boundary) %>% 
# turn the group variable into a factor as well
mutate(Group=factor(Group, levels = c('CTR', 'CRC'))) %>% 

ggplot(aes(x=rel_rank, y=prediction, col=predicted_positive)) + 
geom_hline(yintercept = boundary) +
geom_point() + 
facet_grid(~Group, scales = 'free', space = 'free') + 
xlab('Relative rank within the dataset') + 
ylab('Model prediciton') + 
theme_bw() + 
theme(panel.grid = element_blank())

相关内容

最新更新

热门标签：