r语言 - 如何自动修改间隔因子水平以更好地显示



假设您有如下的数据

df <- data.frame(income = rnorm(1000,77345,30569))

添加一列来表示每个观测值属于

的四分位数间隔因子
df$quant <- cut(df$income, quantile(df$income))

因子水平看起来像这样

Levels: (-4.48e+04,5.6e+04] (5.6e+04,7.69e+04] (7.69e+04,9.73e+04] (9.73e+04,1.64e+05]

如何通过编程而不是手动更改间隔,以便在频率汇总表中很好地打印出来?

df %>% count(quant)

打印如下:

quant   n
1 (-4.48e+04,5.6e+04] 249
2  (5.6e+04,7.69e+04] 250
3 (7.69e+04,9.73e+04] 250
4 (9.73e+04,1.64e+05] 250

我想让它看起来像这样

quant   n
1  ($44,800,$56,000] 249
2  ($56,000,$76,900] 250
3  ($76,900,$97,300] 250
4 ($97,300,$164,000] 250

这只是为了打印目的(在markdown报告中)。我已经完成了所有的计算和绘图,没有任何问题。

cut2可以接受formatfun的参数

library(Hmisc)
library(scales)
df$quant2 <-  cut2(df$income,digits = 5, cuts = quantile(df$income), 
formatfun = function(x) paste0("$", comma(x)), onlycuts = TRUE)

与产出

> head(df)
income             quant2               quant
1  60657.97  [$55,485,$76,547) (5.55e+04,7.65e+04]
2  93747.88  [$76,547,$96,620) (7.65e+04,9.66e+04]
3  90172.46  [$76,547,$96,620) (7.65e+04,9.66e+04]
4  59504.10  [$55,485,$76,547) (5.55e+04,7.65e+04]
5 103251.01 [$96,620,$178,251] (9.66e+04,1.78e+05]
6  85477.03  [$76,547,$96,620) (7.65e+04,9.66e+04]

如果我们想修改原来的cut

library(tidyr)
library(stringr)
df <- df %>%
mutate(quant = str_remove_all(quant, "\(|\]")) %>% 
separate(quant, into = c('q1', 'q2'), sep=",", convert = TRUE) %>% 
mutate(across(q1:q2, ~ dollar(.x)), 
quant = glue::glue("({q1},{q2}]"), q1 = NULL, q2 = NULL)

与产出

> head(df)
income              quant
1  60657.97  ($55,500,$76,500]
2  93747.88  ($76,500,$96,600]
3  90172.46  ($76,500,$96,600]
4  59504.10  ($55,500,$76,500]
5 103251.01 ($96,600,$178,000]
6  85477.03  ($76,500,$96,600]

这是另一个解决方案:

q <- quantile(df$income)
qlbls <- sapply(1:4,function(i) paste0("(",scales::dollar(q[i]),",",scales::dollar(q[i+1]),"]"))
df$quant <- cut(df$income, q, labels = qlbls)

与产出

> head(df)
income                    quant
1  43842.61  ($29,745.22,$56,569.49]
2  73176.84  ($56,569.49,$76,945.41]
3  85658.10  ($76,945.41,$99,013.99]
4  84613.72  ($76,945.41,$99,013.99]
5 130301.96    ($99,013.99,$179,552]
6  61917.61  ($56,569.49,$76,945.41]

这个解决方案非常灵活,你可以把因子做成你喜欢的样子。

chop_quantiles()从我的santoku包是很好的:

library(santoku)
d <- rnorm(10)
# labelled by the quantiles:
chop_quantiles(d, c(0.1, 0.5, 0.9))
[1] [0%, 10%)   [10%, 50%)  [50%, 90%]  [10%, 50%)  [10%, 50%)  (90%, 100%]
[7] [50%, 90%]  [10%, 50%)  [50%, 90%]  [50%, 90%] 
Levels: [0%, 10%) [10%, 50%) [50%, 90%] (90%, 100%]
# by the raw values:
chop_quantiles(d, c(0.1, 0.5, 0.9), labels = lbl_intervals(raw = TRUE))
[1] [-2.515, -1.633)  [-1.633, -0.8172) [-0.8172, 0.3274] [-1.633, -0.8172)
[5] [-1.633, -0.8172) (0.3274, 0.4165]  [-0.8172, 0.3274] [-1.633, -0.8172)
[9] [-0.8172, 0.3274] [-0.8172, 0.3274]
4 Levels: [-2.515, -1.633) [-1.633, -0.8172) ... (0.3274, 0.4165]
# format string passed to sprintf():
chop_quantiles(d, c(0.1, 0.5, 0.9), labels = lbl_intervals(raw = TRUE, fmt = "%.2f"))
[1] [-2.52, -1.63) [-1.63, -0.82) [-0.82, 0.33]  [-1.63, -0.82) [-1.63, -0.82)
[6] (0.33, 0.42]   [-0.82, 0.33]  [-1.63, -0.82) [-0.82, 0.33]  [-0.82, 0.33] 
Levels: [-2.52, -1.63) [-1.63, -0.82) [-0.82, 0.33] (0.33, 0.42]
# different kinds of labels:
chop_quantiles(d, c(0.1, 0.5, 0.9), labels = lbl_dash(" - ", raw = TRUE))
[1] -2.515 - -1.633  -1.633 - -0.8172 -0.8172 - 0.3274 -1.633 - -0.8172
[5] -1.633 - -0.8172 0.3274 - 0.4165  -0.8172 - 0.3274 -1.633 - -0.8172
[9] -0.8172 - 0.3274 -0.8172 - 0.3274
4 Levels: -2.515 - -1.633 -1.633 - -0.8172 ... 0.3274 - 0.4165
# make your own:
chop_quantiles(d, c(0.1, 0.5, 0.9), labels = lbl_glue("{l} to {r}", raw = TRUE))
[1] -2.515 to -1.633  -1.633 to -0.8172 -0.8172 to 0.3274 -1.633 to -0.8172
[5] -1.633 to -0.8172 0.3274 to 0.4165  -0.8172 to 0.3274 -1.633 to -0.8172
[9] -0.8172 to 0.3274 -0.8172 to 0.3274

最新更新