组合方案以将中位数替换为 R 中的组



我有数据集

mydat <- 
structure(list(code = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("52382MCK", 
"52499MCK"), class = "factor"), item = c(11709L, 11709L, 11709L, 
11709L, 11708L, 11708L, 11708L, 11710L, 11710L, 11710L, 11710L, 
11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 
11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 11710L, 
11710L, 11202L, 11203L, 11203L, 11204L, 11204L, 11205L, 11205L
), sales = c(30L, 10L, 20L, 15L, 2L, 10L, 3L, 30L, 10L, 20L, 
15L, 2L, 10L, 3L, 30L, 10L, 20L, 15L, 2L, 10L, 3L, 30L, 10L, 
20L, 15L, 2L, 10L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), action = c(0L, 
1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 
1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 
1L, 1L)), row.names = c(NA, -35L), class = "data.frame")
# coerce to data.table
setDT(mydat)

使用此数据集,将执行多个操作。

1. selecting scenario by groups.

所以有操作列。它只能有两个值 0 (0( 或 one(1(。

方案是第一类操作之前的零操作类别数和一类操作之后的零类别数。

For example
52382МСК    11709

当我们在第一类动作列之前有 1 个零类别的动作列,在第一类动作列之后有两个零时,这可能是场景。 注意:当我们在第一类动作列之前有 2 个零类别的动作列,在第一类动作列之后有 1 个零时,可能是这种情况。

mydat1
code    item    sales   action
52382МСК    11709   30  0
52382МСК    11709   10  1
52382МСК    11709   20  0
52382МСК    11709   15  0

为了检测这种情况,我使用此脚本/这个脚本很好用,谢谢@Uwe

library(data.table)
library(magrittr)
max_zeros <- 3
zeros <- sapply(0:max_zeros, stringr::str_dup, string = "0")
names(zeros) <- as.character(nchar(zeros))
sc <- CJ(zeros.before = zeros, zeros.after = zeros)[
, scenario.name := paste(nchar(zeros.before), nchar(zeros.after), sep = "-")][
, action.pattern := sprintf("%s1+(?=%s)", zeros.before, zeros.after)][]
# special case: all zero
sc0 <- data.table(
zeros.before = NA,
zeros.after = NA, 
scenario.name = "no1", 
action.pattern = "^0+$")
sc <- rbind(sc0, sc)

然后

setDT(mydat)
class <- mydat[, .(scenario.name = sc$scenario.name[
paste(action, collapse = "") %>% 
stringr::str_count(sc$action.pattern) %>%
is_greater_than(0) %>% 
which() %>% 
max()
]),
by = .(code, item)][]
class
mydat[class, on = .(code, item)]

所以我用场景类获取数据。

2.operation it is replace median.

对于每个情景,将计算按零类别计算的中位数。

我需要按操作列计算 1 个操作列之前的零的中位数,即在一个类别的操作列之前,以及一个类别之后的操作列的 2 个零。 仅对第一类操作列执行的中位数替换 按销售列。 如果中位数大于销售额,则不要替换它。

为此,我使用脚本

sales_action <- function(DF, zeros_before, zeros_after) {
library(data.table)
library(magrittr)
action_pattern <- 
do.call(sprintf, 
c(fmt = "%s1+(?=%s)", 
stringr::str_dup("0", c(zeros_before, zeros_after)) %>% as.list()
))
message("Action pattern used: ", action_pattern)
setDT(DF)[, rn := .I]
tmp <- DF[, paste(action, collapse = "") %>% 
stringr::str_locate_all(action_pattern) %>% 
as.data.table() %>% 
lapply(function(x) rn[x]),
by = .(code, item)][
, end := end + zeros_after]
DF[tmp, on = .(code, item, rn >= start, rn <= end), 
med := as.double(median(sales[action == 0])), by = .EACHI][
, output := as.double(sales)][action == 1, output := pmin(sales, med)][
, c("rn", "med") := NULL][]
}

然后

sales_action(mydat, 1L, 2L)

所以我得到了结果。

该问题基于以下内容

每次我必须手动输入场景以替换为中位数

sales_action(mydat, 1L, 2L)
sales_action(mydat, 3L, 1L)
sales_action(mydat, 2L, 2L)

等等。

如何自动为所有可能的情况执行替换中位数 这样我就不会每次都写 sales_action(mydat, .L, .L(

所以输出的例子

code    i    tem    sales   action  output  pattern
52382MCK    11709   30        0       30    01+00
52382MCK    11709   10        1       10    01+00
52382MCK    11709   20        0       20    01+00
52382MCK    11709   15        0       15    01+00
52382MCK    1170    8         0        8    01+00
52382MCK    1170    10        1        8    01+00
52382MCK    1170    2         0        2    01+00
52382MCK    1170    15        0        15   01+00

如果我理解正确,OP 希望通过将操作期间sales数字与销售操作前后期间的销售中位数进行比较来分析销售操作的成功。

存在一些挑战:

  1. 每个codeitem组可能有多个销售操作。
  2. 可用数据可能涵盖的销售操作前后的三天少于请求的 3 天。

恕我直言,场景的引入是处理问题 2 的绕道而行。

以下方法:

  • 确定每个codeitem组中的销售行为,
  • 在每个销售操作之前最多选取
  • 三行零操作行,在每个销售操作之后最多选取三行,
  • 计算这些行的销售额中位数,以及
  • 更新output以防销售操作中的销售数字超过周围零操作行的中位数。

术语类别是由OP创造的,用于区分销售行为的周期(连续的action == 1L(和前后的零操作期间。

library(data.table)
# coerce to data.table and create categories
setDT(mydat)[, cat := rleid(action), by = .(code, item)][]
# extract action categories, identify preceeding & succeeding zero action categories
mycat <- mydat[, .(action = first(action)), by = .(code, item, cat = cat)][
, `:=`(before = cat - 1L, after = cat + 1L)][action == 1L]
mycat
code  item cat action before after
1: 52382MCK 11709   2      1      1     3
2: 52382MCK 11708   2      1      1     3
3: 52382MCK 11710   2      1      1     3
4: 52382MCK 11710   4      1      3     5
5: 52382MCK 11710   6      1      5     7
6: 52499MCK 11203   2      1      1     3
7: 52499MCK 11205   1      1      0     2

请注意,组52382MCK, 11710包括三个单独的销售操作。beforeafter可能指向不存在的cat但这将在后续联接期间自动纠正。

# compute median of surrouding zero action categories
action_cat_median <- 
rbind(
# get sales from up to 3 zero action rows before action category
mydat[mycat, on = .(code, item, cat = before), 
.(sales = tail(sales, 3), i.cat), by =.EACHI],
# get sales from up to 3 zero action rows after action category
mydat[mycat, on = .(code, item, cat = after), 
.(sales = head(sales, 3), i.cat), by =.EACHI]
)[
# remove empty groups
!is.na(sales)][
# compute median for each action category
, .(med = as.double(median(sales))), by = .(code, item, cat = i.cat)]
action_cat_median
code  item cat  med
1: 52382MCK 11709   2 20.0
2: 52382MCK 11708   2  2.5
3: 52382MCK 11710   2 10.0
4: 52382MCK 11710   4 10.0
5: 52382MCK 11710   6 10.0
6: 52499MCK 11203   2  2.0
# prepare result
mydat[, output := as.double(sales)][
# update join
action_cat_median, on = .(code, item, cat), output := pmin(sales, med)]

编辑:或者,对pmin()的调用可以替换为非 equi 联接,该联接仅更新销售额超过中位数的行:

# prepare result, alternative approach
mydat[, output := as.double(sales)][
# non-equi update join
action_cat_median, on = .(code, item, cat, output > med), output := med]

mydat
code  item sales action cat output
1: 52382MCK 11709    30      0   1   30.0
2: 52382MCK 11709    10      1   2   10.0
3: 52382MCK 11709    20      0   3   20.0
4: 52382MCK 11709    15      0   3   15.0
5: 52382MCK 11708     2      0   1    2.0
6: 52382MCK 11708    10      1   2    2.5
7: 52382MCK 11708     3      0   3    3.0
8: 52382MCK 11710    30      0   1   30.0
9: 52382MCK 11710    10      0   1   10.0
10: 52382MCK 11710    20      0   1   20.0
11: 52382MCK 11710    15      1   2   10.0
12: 52382MCK 11710     2      0   3    2.0
13: 52382MCK 11710    10      0   3   10.0
14: 52382MCK 11710     3      0   3    3.0
15: 52382MCK 11710    30      0   3   30.0
16: 52382MCK 11710    10      0   3   10.0
17: 52382MCK 11710    20      0   3   20.0
18: 52382MCK 11710    15      1   4   10.0
19: 52382MCK 11710     2      0   5    2.0
20: 52382MCK 11710    10      0   5   10.0
21: 52382MCK 11710     3      0   5    3.0
22: 52382MCK 11710    30      0   5   30.0
23: 52382MCK 11710    10      0   5   10.0
24: 52382MCK 11710    20      0   5   20.0
25: 52382MCK 11710    15      1   6   10.0
26: 52382MCK 11710     2      0   7    2.0
27: 52382MCK 11710    10      0   7   10.0
28: 52382MCK 11710     3      0   7    3.0
29: 52499MCK 11202     2      0   1    2.0
30: 52499MCK 11203     2      0   1    2.0
31: 52499MCK 11203     2      1   2    2.0
32: 52499MCK 11204     2      0   1    2.0
33: 52499MCK 11204     2      0   1    2.0
34: 52499MCK 11205     2      1   1    2.0
35: 52499MCK 11205     2      1   1    2.0
code  item sales action cat output

更新了以下行:

mydat[output != sales]
code  item sales action cat output
1: 52382MCK 11708    10      1   2    2.5
2: 52382MCK 11710    15      1   2   10.0
3: 52382MCK 11710    15      1   4   10.0
4: 52382MCK 11710    15      1   6   10.0

最新更新