我有一个数据集,如下所示:
df <- tribble(
~id, ~price, ~type, ~number_of_book,
"1", 10, "X", 3,
"1", 2, "X", 1,
"1", 5, "Y", 1,
"2", 7, "X", 4,
"2", 6, "X", 1,
"2", 6, "Y", 2,
"3", 2, "X", 4,
"3", 8, "X", 2,
"3", 1, "Y", 4,
"3", 9, "Y", 5,
)
现在,我想回答这个问题:对于每个 id 和每个选定的价格组,X 的书籍百分比是多少,Y 的百分比是多少?换句话说,每个 id 和价格组的图书类型分布情况如何?
为此,首先我需要在脑海中可视化此数据集:
agg_df <- tribble(
~type, ~id, ~less_than_two, ~two-five, ~five-six, ~more_than_six,
"X", "1", 1, 0, 0, 3,
"Y", "1", 0, 1, 0, 0,
"X", "2", 0, 0, 1, 4,
"Y", "2", 0, 0, 2, 2,
"X", "3", 4, 0, 0, 2,
"Y", "3", 4, 0, 0, 5,
)
然后,这将是我想要的数据集:
desired_df <- tribble(
~type, ~id, ~less_than_two, ~three-five, ~five-six, ~more_than_six,
"X", "1", "100%", "0%", "0%", "100%",
"Y", "1", "0%", "100%", "0%", "0%",
"X", "2", "0%", "0%", "33.3%", "66.6%",
"Y", "2", "0%", "0%", "66.6%", "33.3%",
"X", "3", "50%", "0%", "0%", "28.5%",
"Y", "3", "50%", "0%", "0%", "71.4%",
)
这个期望的数据集告诉我,当id为"3"并且价格箱超过六美元时,X类型有两本书,但Y类型有五本书。所以,这里是分布:X(28.5%(和Y(71.4%(。
注意:我在这里有一个类似的问题,但现在是更复杂的操作,我无法设法获得它:如何操作(聚合(R中的数据?
如果您能帮助我,我将不胜感激。提前谢谢。
我们可以使用findInterval
将price
分成不同的组,计算每个id
、type
和price_group
的number_of_book
sum
,然后计算每个id
和price_group
的比率。最后,我们使用pivot_wider
.
library(dplyr)
df %>%
mutate(price_group = c("less_than_two", "three_five", "five_six", "more_than_six")
[findInterval(price, c(2, 5, 6), left.open = TRUE) + 1]) %>%
group_by(id, type, price_group) %>%
summarise(number_of_book = sum(number_of_book)) %>%
group_by(id, price_group) %>%
mutate(n = number_of_book/sum(number_of_book) * 100) %>%
select(-number_of_book) %>%
tidyr::pivot_wider(names_from = price_group, values_from = n,
values_fill = list(n = 0))
# id type less_than_two more_than_six three_five five_six
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 1 X 100 100 0 0
#2 1 Y 0 0 100 0
#3 2 X 0 100 0 33.3
#4 2 Y 0 0 0 66.7
#5 3 X 50 28.6 0 0
#6 3 Y 50 71.4 0 0
我们可以创建一个在"价格"列上带有cut
的箱组,按"id"、"grp"分组,通过将"number_of_book"除以"number_of_book"sum
来创建百分比,并重新塑造为"宽"格式
library(dplyr)
library(tidyr)
df %>%
group_by(id,grp = cut(price, breaks = c(-Inf, 2, 5, 6, Inf),
c('less_than_two', 'three-five', 'five-six', 'more_than_six')), add = TRUE) %>%
mutate(Perc = 100 *number_of_book/sum(number_of_book)) %>%
select(-price, -number_of_book) %>%
mutate(rn = row_number()) %>%
pivot_wider(names_from = grp, values_from = Perc, values_fill = list(Perc = 0)) %>%
select(-rn)
# A tibble: 6 x 6
# Groups: id [3]
# id type more_than_six less_than_two `three-five` `five-six`
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#1 1 X 100 100 0 0
#2 1 Y 0 0 100 0
#3 2 X 100 0 0 33.3
#4 2 Y 0 0 0 66.7
#5 3 X 28.6 50 0 0
#6 3 Y 71.4 50 0 0
也许不是完美的解决方案,但另一种方法是使用case_when
来定义不同的类别:
library(tidyverse)
df %>% group_by(id, type, price) %>%
mutate(Less2 = case_when(price <= 2 ~ cumsum(number_of_book)),
Three_Five = case_when(price %in% 3:5 ~ cumsum(number_of_book)),
Five_six = case_when(price %in% 5:6 ~ cumsum(number_of_book)),
More_six = case_when(price >6 ~ cumsum(number_of_book))) %>%
replace(is.na(.),0) %>%
ungroup(.) %>%
group_by(id, type) %>%
summarise_at(vars(Less2:More_six), ~sum(.)) %>%
ungroup(.) %>%
group_by(id) %>%
mutate_at(vars(Less2:More_six), ~ replace_na(./sum(.), 0))
# A tibble: 6 x 6
# Groups: id [3]
id type Less2 Three_Five Five_six More_six
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 X 100 0 0 100
2 1 Y 0 100 100 0
3 2 X 0 0 33.3 100
4 2 Y 0 0 66.7 0
5 3 X 50 0 0 28.6
6 3 Y 50 0 0 71.4
```