我有一个具有PPG销售的数据框。
数据是
df= structure(list(Ppg = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("p1",
"p2"), class = "factor"), product = structure(c(1L, 2L, 3L, 4L,
1L, 2L, 3L), .Label = c("A", "B", "C", "D"), class = "factor"),
sales = c(50, 40, 30, 80, 100, 70, 30)), .Names = c("Ppg",
"product", "sales"), row.names = c(NA, -7L), class = "data.frame")
> df
Ppg product sales
1 p1 A 50
2 p1 B 40
3 p1 C 30
4 p1 D 80
5 p2 A 100
6 p2 B 70
7 p2 C 30
我使用dplyr
检索了累积总和df %>% group_by(Ppg) %>% summarise(sale = sum(sales) %>% mutate(c1 = cumsum(sales))
Ppg product sales c1
<fctr> <fctr> <dbl> <dbl>
1 p1 A 50 50
2 p1 B 40 90
3 p1 C 30 120
4 p1 D 80 200
5 p2 A 100 100
6 p2 B 70 170
7 p2 C 30 200
有什么办法
i(计算销售比例(基于库姆(
ii(有多少种不同的产品对销售的某些百分比造成了贡献。
PPG P1的Exmple,2种不同的产品(A&amp; b Combind提供75%的销售(
所以最后的东西是理想的
ppg Number_Products_towards_75%
p1 2
p2 1
假设您使用该产品当前正在为您回答的顺序,因为重新排序行会给您带来不同的结果(:
对于1,您可以通过额外的突变获得结果。只需将累积总和除以该组的所有销售量:
df %>%
group_by(Ppg) %>%
mutate(c1 = cumsum(sales)) %>%
mutate(percent = c1 / sum(sales))
让你:
# A tibble: 7 x 5
# Groups: Ppg [2]
Ppg product sales c1 percent
<fctr> <fctr> <dbl> <dbl> <dbl>
1 p1 A 50.0 50.0 0.250
2 p1 B 40.0 90.0 0.450
3 p1 C 30.0 120 0.600
4 p1 D 80.0 200 1.00
5 p2 A 100 100 0.500
6 p2 B 70.0 170 0.850
7 p2 C 30.0 200 1.00
对于2,您可以使用突变型添加一列,以使该产品低于阈值并总结以计数低于阈值的产品(然后在计数中添加一个产品,因为又有一个可以使您越过它(。
threshold <- 0.5
df %>%
group_by(Ppg) %>%
mutate(c1 = cumsum(sales)) %>%
mutate(percent = c1 / sum(sales)) %>%
mutate(isbelowthreshold = percent < threshold) %>% # add a column for if it's below the threshold
summarize(count = sum(isbelowthreshold) + 1) # we need to add one since one extra product will put you over the threshold
让你:
# A tibble: 2 x 2
Ppg count
<fctr> <dbl>
1 p1 3.00
2 p2 1.00
,但这再次取决于产品的顺序。考虑首先从最高值订购它们?像
df %>%
group_by(Ppg) %>%
arrange(Ppg, desc(sales))