r语言 - 有没有一种简单的方法来重新编码因子变量的水平,以便将低于给定频率的水平重新编码为"other"



保留<-c(0.001,0.5,0.1(

df$a df$b df$c-基于低于第一阈值的电平频率重新编码电平

df$x df$y df$x-基于低于第二阈值的电平频率重新编码电平

df$d df$e df$f-基于低于第三阈值的电平频率重新编码电平

您正在从forcats中查找fct_lump_prop()

library(forcats)
library(dplyr)
dat <- data.frame(base = c("A", "A", "A",
"B", "B",
"C",
"D"))
dat |> mutate(base0.2 = fct_lump_prop(base, 0.2),
base0.3 = fct_lump_prop(base, 0.3))

输出

#>   base base0.2 base0.3
#> 1    A       A       A
#> 2    A       A       A
#> 3    A       A       A
#> 4    B       B   Other
#> 5    B       B   Other
#> 6    C   Other   Other
#> 7    D   Other   Other

创建于2022-03-31由reprex包(v2.0.0(

可能有一种更简单的tidy方法可以做到这一点,但您可以编写一个小函数来实现它:

set.seed(519)
x <- sample(LETTERS[1:5], 1000, prob=c(.01,.1,.29,.3,.3), replace=TRUE)
x <- as.factor(x)
recode_thresh <- function(x, threshold = .15){
tab <- table(x)/sum(table(x))
levs <- levels(x)
levs <- c(levs, "other")
x <- as.character(x)
if(any(tab < threshold)){
x <- ifelse(x %in% names(tab)[which(tab < threshold)], "other", x)
}
levs <- intersect(levs, unique(x))
factor(x, levels=levs)
}
x2 <- recode_thresh(x, threshold=.15)
table(x)/1000
#> x
#>     A     B     C     D     E 
#> 0.014 0.106 0.294 0.276 0.310
table(x2)/1000
#> x2
#>     C     D     E other 
#> 0.294 0.276 0.310 0.120

创建于2022-03-31由reprex包(v2.0.1(

根据安德里亚斯的建议和进一步的阅读,我想出了下面的方法,效果很好。感谢

agg_cats_thresholds <- c(0.01, 0.05, 0.005, 0.001)
agg_cats_thresholds <- as.data.frame(agg_cats_thresholds)
#create the lists of variables
factor_columns1 <- c("a", "b","c", "d", "e")
factor_columns2 <- c("f")
factor_columns3 <- c("g")
factor_columns4 <- c("h", "i", "j", "k")
# Use fct_lump_prop to reduce the levels of the various factor variables
churn.ml[factor_columns1] <- lapply(churn.ml[factor_columns1], 
fct_lump_prop, prop = agg_cats_thresholds[1,] 
,other_level = 'other')
churn.ml[factor_columns2] <- lapply(churn.ml[factor_columns2], 
fct_lump_prop, prop = 
agg_cats_thresholds[2,] ,other_level = 'other')
churn.ml[factor_columns3] <- lapply(churn.ml[factor_columns3], 
fct_lump_prop, prop = 
agg_cats_thresholds[3,] ,other_level = 'other')

最新更新