如果我以mtcars为例:
mtcars <- subset(mtcars, select = c("cyl", "disp"))
如何添加两个额外的列,一个指示值低于/高于中位数,另一个指示值在哪个四分位数?但是,我希望每组cyl
都这样做。
这是我所希望的具体结果:
cyl disp median_split quartile_split
Toyota Corolla 4 71.1 below_median 1st_quartile
Honda Civic 4 75.7 below_median 1st_quartile
Fiat 128 4 78.7 below_median 1st_quartile
Fiat X1-9 4 79 below_median 2nd_quartile
Lotus Europa 4 95.1 below_median 2nd_quartile
Datsun 710 4 108 median median
Toyota Corona 4 120.1 above_median 3rd_quartile
Porsche 914-2 4 120.3 above_median 3rd_quartile
Volvo 142E 4 121 above_median 4th_quartile
Merc 230 4 140.8 above_median 4th_quartile
Merc 240D 4 146.7 above_median 4th_quartile
Ferrari Dino 6 145 below_median 1st_quartile
Mazda RX4 6 160 etc… etc…
我将不胜感激。谢谢。
从下面的 akun 的回答中编辑以下内容
在quartile_split
列中,akun的答案在每个cyl组中留下了NA
的最低值。我想我可以通过添加以下内容来解决此问题:
mtcars$quartile_split[is.na(mtcars$quartile_split)] <- "1_quartile" #not a very elegant solution
所以完整的代码将是:
library(dplyr)
mtcars <- subset(mtcars, select = c("cyl", "disp"))
# akrun's answer
mtcars <- mtcars %>%
group_by(cyl) %>%
mutate(median_split = c("above_median", "below_median")[1 +
(disp <= median(disp))],
quartile_split = cut(disp, breaks = quantile(disp),
labels = paste0(1:4, "_quartile")))
# addition
mtcars$quartile_split[is.na(mtcars$quartile_split)] <- "1_quartile" #not a very elegant solution
但是,当我更仔细地观察时,我也发现了其他似乎不太对劲的地方,具体来说,当你只看cyl = 6
组时,你会看到这个:
cyl disp median_split quartile_split
6 145 below_median 1_quartile
6 160 below_median 1_quartile
6 160 below_median 1_quartile
6 167.6 below_median 2_quartile
6 167.6 below_median 2_quartile
6 225 above_median 4_quartile
6 258 above_median 4_quartile
该组中的中位数disp
为163.8,因此disp = 167.6
的两辆车应归类为"above_median",而不是"below_median"。
我希望这可以以某种方式得到解决。再次感谢。
一个选项是按"cyl"分组,使用cut
根据"disp"列上的quantile
创建不同的类别
library(dplyr)
mtcars %>%
group_by(cyl) %>%
mutate(median_split = c("above_median", "below_median")[1 +
(disp <= median(disp))],
quartile_split = cut(disp, breaks = quantile(disp),
labels = paste0(1:4, "_quartile")))
使用基本 R 和cut
:
mtcars <- subset(mtcars, select = c("cyl", "disp"))
mtcars$median_split <- ifelse(mtcars$disp <= median(mtcars$disp), "below_median","above_median")
mtcars$quantile_split <- cut(mtcars$disp, breaks = c(0, quantile(mtcars$disp)),labels = c("1_quartile",paste0(1:4, "_quartile")))
使用cut
函数时要小心,以确保中断包括最小值(否则它将返回 NA),并且最小值标记在第 1 个四分位数中。