R:难以计算百分位数?

  • 本文关键字:百分 计算 r
  • 更新时间 :
  • 英文 :


我正在使用R编程语言。

我有以下数据集:

library(dplyr)
set.seed(123)
gender <- factor(sample(c("Male", "Female"), 5000, replace=TRUE, prob=c(0.45, 0.55)))
status <- factor(sample(c("Immigrant", "Citizen"), 5000, replace=TRUE, prob=c(0.3, 0.7)))
country <- factor(sample(c("A", "B", "C", "D"), 5000, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25)))
disease <- factor(sample(c("Yes", "No"), 5000, replace=TRUE, prob=c(0.4, 0.6)))
my_data <- data.frame(gender, status, disease, country, var1 = rnorm(5000, 5000, 5000), var2 = rnorm(5000, 5000, 5000))

然后我用这个函数来计算变量的任意百分位数:

# source: https://stackoverflow.com/questions/74947154/r-using-dplyr-to-perform-conditional-functions
ptile <- function(x, n_percentiles) {
# Calculate the percentiles
pct <- quantile(x, probs = seq(0, 1, 1/n_percentiles))
# Create a character vector to store the labels
labels <- sprintf("%.2f to %.2f percentile %d",
head(pct, -1), tail(pct, -1), seq_len(n_percentiles))
cut(x, breaks = pct, labels = labels, include.lowest = TRUE)
}

当我有时使用这个函数时:

# error not produced on this dataset, but on other datasets
na.omit(my_data) %>%
group_by(gender, status, country) %>%
mutate(result1 = ptile(var1, 10), result2 = ptile(var2, 5))

我得到两个错误之一:

  • cut.default(x, breaks = pct, labels = labels, include)出错。最低= TRUE):无效的间隔数

  • cut.default(x, breaks = pct, labels = labels, include)出错。最低= TRUE): 'break '不是唯一的

起初,我认为这些错误正在产生,因为我在"组行"上使用了这个函数。在其中一些行中,可能由于行数太少而无法计算所需的百分位数?

我原本以为也许我可以通过排除"组"来解决这个问题。行数不足时:

na.omit(my_data) %>%
group_by(gender, status, country) %>%
filter(n() < 5) %>%
mutate(result1 = ptile(var1, 10), result2 = ptile(var2, 5))

但是同样的错误仍然存在。

我想知道-是否有某种方法可以修改这个百分位数函数,以便当百分位数可能无法在所需的水平上计算时,可以计算下一个最接近的百分位数水平?

举个例子,如果我想要10个组的百分位数,而这是不可能的——也许15个组或20个组的百分位数是可能的?

作为另一个例子,假设一些观察组(例如男性,移民,国家A)只有1个观察值,而我想要10个组中的百分位数-自然,这似乎是不可能的。在事先不知道这样一个组存在的情况下,是否有可能修改这个ptile函数,使其忽略这个组或只是计算最接近的可能百分位数(例如,将所有内容放入1中)?

在一般情况下,我如何改变这个ptile函数,使这些错误可以被修复?

谁能建议一个方法来做到这一点?

谢谢!

注意:我也愿意用其他方法来写一个函数/解决这个问题

如果我们需要绕过这些错误,即当组中的元素数量较少时,可以选择使用tryCatchpurrr::possibly

library(dplyr)
library(purrr)
f_ptile <- possibly(ptile, otherwise = factor(NA_character_))

测试

na.omit(my_data) %>%
group_by(gender, status, country) %>%
slice(1) %>% # this fails with original ptile function
mutate(result1 = f_ptile(var1, 10), result2 = f_ptile(var2, 5))
# A tibble: 16 × 8
# Groups:   gender, status, country [16]
gender status    disease country    var1   var2 result1 result2
<fct>  <fct>     <fct>   <fct>     <dbl>  <dbl> <fct>   <fct>  
1 Female Citizen   No      A       11902.  -2436. <NA>    <NA>   
2 Female Citizen   Yes     B        3508.   6851. <NA>    <NA>   
3 Female Citizen   Yes     C       16854.  -1769. <NA>    <NA>   
4 Female Citizen   No      D       12372.   4363. <NA>    <NA>   
5 Female Immigrant Yes     A        9635.    695. <NA>    <NA>   
6 Female Immigrant No      B        3120.  -2674. <NA>    <NA>   
7 Female Immigrant No      C       -1635.   3971. <NA>    <NA>   
8 Female Immigrant No      D       -3163.   5802. <NA>    <NA>   
9 Male   Citizen   No      A        3836.   8196. <NA>    <NA>   
10 Male   Citizen   No      B        6125.   8096. <NA>    <NA>   
11 Male   Citizen   Yes     C        2159.   9863. <NA>    <NA>   
12 Male   Citizen   No      D        3622.   8159. <NA>    <NA>   
13 Male   Immigrant No      A       -3003.   6874. <NA>    <NA>   
14 Male   Immigrant Yes     B         -27.1  4063. <NA>    <NA>   
15 Male   Immigrant No      C        4166.   2103. <NA>    <NA>   
16 Male   Immigrant No      D        5718.  17866. <NA>    <NA>   

这不是一个真正的答案,但我确信它会有所帮助。

首先我认为这是原因:

原因是在var 1和var 2中break和label的长度不相等:

pct_var1 <- quantile(my_data$var1, probs = seq(0, 1, 1/10))
length(pct_var1)
# Create a character vector to store the labels
labels_var1 <- sprintf("%.2f to %.2f percentile %d",
head(pct, -1), tail(pct, -1), seq_len(10))
length(labels_var1)
cut_var1 <- cut(my_data$var1, breaks = pct, labels = labels, include.lowest = TRUE)

pct_var2 <- quantile(my_data$var2, probs = seq(0, 1, 1/5))
length(pct_var2)
# Create a character vector to store the labels
labels_var2 <- sprintf("%.2f to %.2f percentile %d",
head(pct, -1), tail(pct, -1), seq_len(5))
length(labels_var2)
cut_var2 <- cut(my_data$var2, breaks = pct, labels = labels, include.lowest = TRUE)
> length(pct_var1)
[1] 11
> length(labels_var1)
[1] 10
> length(pct_var2)
[1] 6               
> length(labels_var2)
[1] 5

结合这篇文章,Cut和标签/break的长度冲突应该是可以解决的。

但是后来我遇到了你的第二个数据帧例子:

na.omit(my_data) %>%
group_by(gender, status, country) %>%
filter(n() < 5)

删除所有数据

尝试自己解决这个问题:

尝试1:

final_answer = my_data %>% group_by(gender, status, country) %>% 
mutate(result1 = case_when(ntile(var1, 10) == 1 ~ paste0(round(min(var1), 2), " to ", round(quantile(var1, 0.1), 2), " group 1"),
ntile(var1, 10) == 2 ~ paste0(round(quantile(var1, 0.1), 2), " to ", round(quantile(var1, 0.2), 2), " group 2"),
ntile(var1, 10) == 3 ~ paste0(round(quantile(var1, 0.2), 2), " to ", round(quantile(var1, 0.3), 2), " group 3"),
ntile(var1, 10) == 4 ~ paste0(round(quantile(var1, 0.3), 2), " to ", round(quantile(var1, 0.4), 2), " group 4"),
ntile(var1, 10) == 5 ~ paste0(round(quantile(var1, 0.4), 2), " to ", round(quantile(var1, 0.5), 2), " group 5"),
ntile(var1, 10) == 6 ~ paste0(round(quantile(var1, 0.5), 2), " to ", round(quantile(var1, 0.6), 2), " group 6"),
ntile(var1, 10) == 7 ~ paste0(round(quantile(var1, 0.6), 2), " to ", round(quantile(var1, 0.7), 2), " group 7"),
ntile(var1, 10) == 8 ~ paste0(round(quantile(var1, 0.7), 2), " to ", round(quantile(var1, 0.8), 2), " group 8"),
ntile(var1, 10) == 9 ~ paste0(round(quantile(var1, 0.8), 2), " to ", round(quantile(var1, 0.9), 2), " group 9"),
ntile(var1, 10) == 10 ~ paste0(round(quantile(var1, 0.9), 2), " to ", round(max(var1), 2), " group 10"))) %>% 
mutate(result2 = case_when(ntile(var2, 10) == 1 ~ paste0(round(min(var2), 2), " to ", round(quantile(var2, 0.1), 2), " group 1"),
ntile(var2, 10) == 2 ~ paste0(round(quantile(var2, 0.1), 2), " to ", round(quantile(var2, 0.2), 2), " group 2"),
ntile(var2, 10) == 3 ~ paste0(round(quantile(var2, 0.2), 2), " to ", round(quantile(var2, 0.3), 2), " group 3"),
ntile(var2, 10) == 4 ~ paste0(round(quantile(var2, 0.3), 2), " to ", round(quantile(var2, 0.4), 2), " group 4"),
ntile(var2, 10) == 5 ~ paste0(round(quantile(var2, 0.4), 2), " to ", round(quantile(var2, 0.5), 2), " group 5"))

尝试2:

percentile_classifier <- function(x, n_percentiles) {
# Calculate the percentiles
percentiles <- quantile(x, probs = seq(0, 1, 1/n_percentiles))
# Create a character vector to store the labels
labels <- character(length(x))
# Loop through each percentile and assign the corresponding label to each element in the vector
for (i in 1:length(percentiles)) {
lower <- percentiles[i]
upper <- ifelse(i == length(percentiles), max(x), percentiles[i+1])
label <- paste0(round(lower, 2), " to ", round(upper, 2), " percentile ", i)
labels[x >= lower & x < upper] <- label
}
# Return the labels
return(labels)
}
final <- my_data %>% group_by(gender, status, country) %>% mutate(result1 = percentile_classifier(var1, 10))  %>% mutate(result2 = percentile_classifier(var2, 5))

最新更新