根据条件 (R) 在数据框中拆分数据的中位数和四分位数列



如果我以mtcars为例:

mtcars <- subset(mtcars, select = c("cyl", "disp"))

如何添加两个额外的列,一个指示值低于/高于中位数,另一个指示值在哪个四分位数?但是,我希望每组cyl都这样做。

这是我所希望的具体结果:

cyl  disp    median_split    quartile_split
Toyota Corolla    4    71.1    below_median    1st_quartile
Honda Civic       4    75.7    below_median    1st_quartile
Fiat 128          4    78.7    below_median    1st_quartile
Fiat X1-9         4    79      below_median    2nd_quartile
Lotus Europa      4    95.1    below_median    2nd_quartile
Datsun 710        4    108     median          median
Toyota Corona     4    120.1   above_median    3rd_quartile
Porsche 914-2     4    120.3   above_median    3rd_quartile
Volvo 142E        4    121     above_median    4th_quartile
Merc 230          4    140.8   above_median    4th_quartile
Merc 240D         4    146.7   above_median    4th_quartile
Ferrari Dino      6    145     below_median    1st_quartile
Mazda RX4         6    160     etc…            etc…

我将不胜感激。谢谢。

从下面的 akun 的回答中编辑以下内容

quartile_split列中,akun的答案在每个cyl组中留下了NA的最低值。我想我可以通过添加以下内容来解决此问题:

mtcars$quartile_split[is.na(mtcars$quartile_split)] <- "1_quartile" #not a very elegant solution

所以完整的代码将是:

library(dplyr)
mtcars <- subset(mtcars, select = c("cyl", "disp"))
# akrun's answer
mtcars <- mtcars %>%
group_by(cyl) %>% 
mutate(median_split = c("above_median", "below_median")[1 + 
(disp <= median(disp))], 
quartile_split = cut(disp, breaks = quantile(disp), 
labels = paste0(1:4, "_quartile")))
# addition
mtcars$quartile_split[is.na(mtcars$quartile_split)] <- "1_quartile" #not a very elegant solution

但是,当我更仔细地观察时,我也发现了其他似乎不太对劲的地方,具体来说,当你只看cyl = 6组时,你会看到这个:

cyl  disp      median_split    quartile_split
6    145       below_median    1_quartile
6    160       below_median    1_quartile
6    160       below_median    1_quartile
6    167.6     below_median    2_quartile
6    167.6     below_median    2_quartile
6    225       above_median    4_quartile
6    258       above_median    4_quartile

该组中的中位数disp为163.8,因此disp = 167.6的两辆车应归类为"above_median",而不是"below_median"。

我希望这可以以某种方式得到解决。再次感谢。

一个选项是按"cyl"分组,使用cut根据"disp"列上的quantile创建不同的类别

library(dplyr)
mtcars %>%
group_by(cyl) %>% 
mutate(median_split = c("above_median", "below_median")[1 + 
(disp <= median(disp))], 
quartile_split = cut(disp, breaks = quantile(disp), 
labels = paste0(1:4, "_quartile")))

使用基本 R 和cut

mtcars <- subset(mtcars, select = c("cyl", "disp"))
mtcars$median_split <- ifelse(mtcars$disp <= median(mtcars$disp), "below_median","above_median")
mtcars$quantile_split <- cut(mtcars$disp, breaks = c(0, quantile(mtcars$disp)),labels = c("1_quartile",paste0(1:4, "_quartile")))

使用cut函数时要小心,以确保中断包括最小值(否则它将返回 NA),并且最小值标记在第 1 个四分位数中。

最新更新