我正在使用R编程语言。
我有以下数据集:
library(dplyr)
set.seed(123)
gender <- factor(sample(c("Male", "Female"), 5000, replace=TRUE, prob=c(0.45, 0.55)))
status <- factor(sample(c("Immigrant", "Citizen"), 5000, replace=TRUE, prob=c(0.3, 0.7)))
country <- factor(sample(c("A", "B", "C", "D"), 5000, replace=TRUE, prob=c(0.25, 0.25, 0.25, 0.25)))
disease <- factor(sample(c("Yes", "No"), 5000, replace=TRUE, prob=c(0.4, 0.6)))
my_data <- data.frame(gender, status, disease, country, var1 = rnorm(5000, 5000, 5000), var2 = rnorm(5000, 5000, 5000))
然后我用这个函数来计算变量的任意百分位数:
# source: https://stackoverflow.com/questions/74947154/r-using-dplyr-to-perform-conditional-functions
ptile <- function(x, n_percentiles) {
# Calculate the percentiles
pct <- quantile(x, probs = seq(0, 1, 1/n_percentiles))
# Create a character vector to store the labels
labels <- sprintf("%.2f to %.2f percentile %d",
head(pct, -1), tail(pct, -1), seq_len(n_percentiles))
cut(x, breaks = pct, labels = labels, include.lowest = TRUE)
}
当我有时使用这个函数时:
# error not produced on this dataset, but on other datasets
na.omit(my_data) %>%
group_by(gender, status, country) %>%
mutate(result1 = ptile(var1, 10), result2 = ptile(var2, 5))
我得到两个错误之一:
cut.default(x, breaks = pct, labels = labels, include)出错。最低= TRUE):无效的间隔数
cut.default(x, breaks = pct, labels = labels, include)出错。最低= TRUE): 'break '不是唯一的
起初,我认为这些错误正在产生,因为我在"组行"上使用了这个函数。在其中一些行中,可能由于行数太少而无法计算所需的百分位数?
我原本以为也许我可以通过排除"组"来解决这个问题。行数不足时:
na.omit(my_data) %>%
group_by(gender, status, country) %>%
filter(n() < 5) %>%
mutate(result1 = ptile(var1, 10), result2 = ptile(var2, 5))
但是同样的错误仍然存在。
我想知道-是否有某种方法可以修改这个百分位数函数,以便当百分位数可能无法在所需的水平上计算时,可以计算下一个最接近的百分位数水平?
举个例子,如果我想要10个组的百分位数,而这是不可能的——也许15个组或20个组的百分位数是可能的?
作为另一个例子,假设一些观察组(例如男性,移民,国家A)只有1个观察值,而我想要10个组中的百分位数-自然,这似乎是不可能的。在事先不知道这样一个组存在的情况下,是否有可能修改这个ptile函数,使其忽略这个组或只是计算最接近的可能百分位数(例如,将所有内容放入1中)?
在一般情况下,我如何改变这个ptile函数,使这些错误可以被修复?
谁能建议一个方法来做到这一点?
谢谢!
注意:我也愿意用其他方法来写一个函数/解决这个问题
如果我们需要绕过这些错误,即当组中的元素数量较少时,可以选择使用tryCatch
或purrr::possibly
library(dplyr)
library(purrr)
f_ptile <- possibly(ptile, otherwise = factor(NA_character_))
测试
na.omit(my_data) %>%
group_by(gender, status, country) %>%
slice(1) %>% # this fails with original ptile function
mutate(result1 = f_ptile(var1, 10), result2 = f_ptile(var2, 5))
# A tibble: 16 × 8
# Groups: gender, status, country [16]
gender status disease country var1 var2 result1 result2
<fct> <fct> <fct> <fct> <dbl> <dbl> <fct> <fct>
1 Female Citizen No A 11902. -2436. <NA> <NA>
2 Female Citizen Yes B 3508. 6851. <NA> <NA>
3 Female Citizen Yes C 16854. -1769. <NA> <NA>
4 Female Citizen No D 12372. 4363. <NA> <NA>
5 Female Immigrant Yes A 9635. 695. <NA> <NA>
6 Female Immigrant No B 3120. -2674. <NA> <NA>
7 Female Immigrant No C -1635. 3971. <NA> <NA>
8 Female Immigrant No D -3163. 5802. <NA> <NA>
9 Male Citizen No A 3836. 8196. <NA> <NA>
10 Male Citizen No B 6125. 8096. <NA> <NA>
11 Male Citizen Yes C 2159. 9863. <NA> <NA>
12 Male Citizen No D 3622. 8159. <NA> <NA>
13 Male Immigrant No A -3003. 6874. <NA> <NA>
14 Male Immigrant Yes B -27.1 4063. <NA> <NA>
15 Male Immigrant No C 4166. 2103. <NA> <NA>
16 Male Immigrant No D 5718. 17866. <NA> <NA>
这不是一个真正的答案,但我确信它会有所帮助。
首先我认为这是原因:
原因是在var 1和var 2中break和label的长度不相等:
pct_var1 <- quantile(my_data$var1, probs = seq(0, 1, 1/10))
length(pct_var1)
# Create a character vector to store the labels
labels_var1 <- sprintf("%.2f to %.2f percentile %d",
head(pct, -1), tail(pct, -1), seq_len(10))
length(labels_var1)
cut_var1 <- cut(my_data$var1, breaks = pct, labels = labels, include.lowest = TRUE)
pct_var2 <- quantile(my_data$var2, probs = seq(0, 1, 1/5))
length(pct_var2)
# Create a character vector to store the labels
labels_var2 <- sprintf("%.2f to %.2f percentile %d",
head(pct, -1), tail(pct, -1), seq_len(5))
length(labels_var2)
cut_var2 <- cut(my_data$var2, breaks = pct, labels = labels, include.lowest = TRUE)
> length(pct_var1)
[1] 11
> length(labels_var1)
[1] 10
> length(pct_var2)
[1] 6
> length(labels_var2)
[1] 5
结合这篇文章,Cut和标签/break的长度冲突应该是可以解决的。
但是后来我遇到了你的第二个数据帧例子:
na.omit(my_data) %>%
group_by(gender, status, country) %>%
filter(n() < 5)
删除所有数据
尝试自己解决这个问题:
尝试1:
final_answer = my_data %>% group_by(gender, status, country) %>%
mutate(result1 = case_when(ntile(var1, 10) == 1 ~ paste0(round(min(var1), 2), " to ", round(quantile(var1, 0.1), 2), " group 1"),
ntile(var1, 10) == 2 ~ paste0(round(quantile(var1, 0.1), 2), " to ", round(quantile(var1, 0.2), 2), " group 2"),
ntile(var1, 10) == 3 ~ paste0(round(quantile(var1, 0.2), 2), " to ", round(quantile(var1, 0.3), 2), " group 3"),
ntile(var1, 10) == 4 ~ paste0(round(quantile(var1, 0.3), 2), " to ", round(quantile(var1, 0.4), 2), " group 4"),
ntile(var1, 10) == 5 ~ paste0(round(quantile(var1, 0.4), 2), " to ", round(quantile(var1, 0.5), 2), " group 5"),
ntile(var1, 10) == 6 ~ paste0(round(quantile(var1, 0.5), 2), " to ", round(quantile(var1, 0.6), 2), " group 6"),
ntile(var1, 10) == 7 ~ paste0(round(quantile(var1, 0.6), 2), " to ", round(quantile(var1, 0.7), 2), " group 7"),
ntile(var1, 10) == 8 ~ paste0(round(quantile(var1, 0.7), 2), " to ", round(quantile(var1, 0.8), 2), " group 8"),
ntile(var1, 10) == 9 ~ paste0(round(quantile(var1, 0.8), 2), " to ", round(quantile(var1, 0.9), 2), " group 9"),
ntile(var1, 10) == 10 ~ paste0(round(quantile(var1, 0.9), 2), " to ", round(max(var1), 2), " group 10"))) %>%
mutate(result2 = case_when(ntile(var2, 10) == 1 ~ paste0(round(min(var2), 2), " to ", round(quantile(var2, 0.1), 2), " group 1"),
ntile(var2, 10) == 2 ~ paste0(round(quantile(var2, 0.1), 2), " to ", round(quantile(var2, 0.2), 2), " group 2"),
ntile(var2, 10) == 3 ~ paste0(round(quantile(var2, 0.2), 2), " to ", round(quantile(var2, 0.3), 2), " group 3"),
ntile(var2, 10) == 4 ~ paste0(round(quantile(var2, 0.3), 2), " to ", round(quantile(var2, 0.4), 2), " group 4"),
ntile(var2, 10) == 5 ~ paste0(round(quantile(var2, 0.4), 2), " to ", round(quantile(var2, 0.5), 2), " group 5"))
尝试2:
percentile_classifier <- function(x, n_percentiles) {
# Calculate the percentiles
percentiles <- quantile(x, probs = seq(0, 1, 1/n_percentiles))
# Create a character vector to store the labels
labels <- character(length(x))
# Loop through each percentile and assign the corresponding label to each element in the vector
for (i in 1:length(percentiles)) {
lower <- percentiles[i]
upper <- ifelse(i == length(percentiles), max(x), percentiles[i+1])
label <- paste0(round(lower, 2), " to ", round(upper, 2), " percentile ", i)
labels[x >= lower & x < upper] <- label
}
# Return the labels
return(labels)
}
final <- my_data %>% group_by(gender, status, country) %>% mutate(result1 = percentile_classifier(var1, 10)) %>% mutate(result2 = percentile_classifier(var2, 5))