r-在我的类之间创建70%-30%的关系



我有不平衡的数据,比方说:2%坏,98%好。

我现在想做的是重复糟糕的课程,直到(例如(达到70%的糟糕和30%的良好关系。

我知道这是一个相当不寻常的方法(我已经尝试过SMOTE(,但我只是对结果感到好奇。

我将使用这些数据来应用决策树。

示例数据:

> df
class   percentage     color  
bad        0.45        green
bad        0.67        red
bad        0.34        blue
good       0.22        black
good       0.25        pink
good       0.89        green
good       0.76        yellow
good       0.35        grey
good       0.44        red
good       0.99        red
good       0.12        blue
good       0.56        black
good       0.70        pink
good       0.49        yellow

输出为:

> df
class   percentage     color  
bad        0.45        green
bad        0.67        red
bad        0.34        blue
bad        0.45        green
bad        0.67        red
bad        0.34        blue    
bad        0.67        red
bad        0.34        blue
bad        0.45        green
bad        0.45        green
bad        0.67        red
bad        0.34        blue
bad        0.45        green
bad        0.67        red
bad        0.34        blue
bad        0.45        green
bad        0.67        red
bad        0.34        blue
bad        0.45        green
bad        0.67        red
bad        0.34        blue
bad        0.45        green
bad        0.67        red
bad        0.34        blue
good       0.22        black
good       0.25        pink
good       0.89        green
good       0.76        yellow
good       0.35        grey
good       0.44        red
good       0.99        red
good       0.12        blue
good       0.56        black
good       0.70        pink
good       0.49        yellow

首先,我想说你应该避免这种情况,因为你最终可能会得到一个不具代表性的真相样本。

事实上,你只是复制了这3种情况。SMOTE应该是一个更好的方法来重新平衡事物。

无论如何,这里有一种方法:

do.call("rbind", replicate(n_bad, d_bad, simplify = FALSE))

最重要的是这条线复制了坏案例。

library(dplyr)
# we set some parameters that you can play with
n_rows_final <- 100
perc_bad <- 0.7
bad_cases <- nrow(d %>% filter(class=="bad"))
n_bad <- (n_rows_final*perc_bad)/bad_cases # nrows final * desired perc bad
n_good <- (n_rows_final*(1-perc_bad)) # nrows final * desired perc good
# filter the original data
d_bad <- d %>% filter(class=="bad")
d_good <- d %>% filter(class=="good")
set.seed(123)
d_good <- d_good[sample(n_good), ] # sample n_good cases
d_bad <- do.call("rbind", replicate(n_bad, d_bad, simplify = FALSE)) # replicates bad cases n_bad times
d_final <- rbind(d_bad, d_good) # binds
table(d_final$class)
# bad good 
#  69   11

数据:

tt <- "class   percentage     color  
bad        0.45        green
bad        0.67        red
bad        0.34        blue
good       0.22        black
good       0.25        pink
good       0.89        green
good       0.76        yellow
good       0.35        grey
good       0.44        red
good       0.99        red
good       0.12        blue
good       0.56        black
good       0.70        pink
good       0.49        yellow"
d <- read.table(text=tt, header=T)

不确定这是否是最有效的方法,但它应该有效:

class <- c("bad","bad","bad","good","good","good","good","good","good","good","good")
val <- rnorm(length(class))
df <- data.frame(class, val)
# calculate number of bad rows required
n <- round(sum(df$class == "good") * (7/3)) - sum(df$class == "bad")
# create df of bad rows to sample from
bad.df <- df %>% filter(class == "bad")
# sample rows n times and create df of required size
s <- sample(1:3, n, replace = TRUE)
bad.df <- bad.df[s, ]
# bind to original df
df2 <- bind_rows(df, bad.df)
prop.table(table(df2$class))

您可以尝试

library(tidyverse)
set.seed(134)
d %>%
  group_by(class) %>% 
  sample_n(size = 100, replace = T) %>% 
  split(.$class) %>% 
  map2(.,c(0.3, 0.7), ~mutate(.x, gr=sample(c(TRUE, FALSE), size = n(), replace = T, prob = c(1-.y, .y)))) %>% 
  bind_rows() %>% 
  ungroup() %>% 
  filter(gr) %>% 
  select(-gr)
# A tibble: 101 x 3
   class percentage color
   <fct>      <dbl> <fct>
 1 bad         0.45 green
 2 bad         0.34 blue 
 3 bad         0.34 blue 
 4 bad         0.67 red  
 5 bad         0.67 red  
 6 bad         0.34 blue 
 7 bad         0.45 green
 8 bad         0.34 blue 
 9 bad         0.67 red  
10 bad         0.34 blue 
# ... with 91 more rows
.Last.value %>% 
  count(class)
# A tibble: 2 x 2
  class     n
  <fct> <int>
1 bad      71
2 good     28

我们的想法是对两组样本进行相同大小的采样(此处为100,但可以增加到100(。然后添加一个具有相应概率70:30的滤波器变量gr

最新更新