如何为 R 中的每一行分配数据?



我有一个数据集,如下所示:

df <- tribble(
~id,  ~price, ~type, ~number_of_book,        
"1",    10,     "X",        3,    
"1",     2,     "X",        1, 
"1",     5,     "Y",        1,         
"2",     7,     "X",        4,
"2",     6,     "X",        1,
"2",     6,     "Y",        2, 
"3",     2,     "X",        4,
"3",     8,     "X",        2,
"3",     1,     "Y",        4,
"3",     9,     "Y",        5,
)

现在,我想回答这个问题:对于每个 id 和每个选定的价格组,X 的书籍百分比是多少,Y 的百分比是多少?换句话说,每个 id 和价格组的图书类型分布情况如何?

为此,首先我需要在脑海中可视化此数据集:

agg_df <- tribble(
~type,     ~id,       ~less_than_two,    ~two-five,  ~five-six, ~more_than_six,     
"X",      "1",              1,               0,           0,            3,
"Y",      "1",              0,               1,           0,            0,
"X",      "2",              0,               0,           1,            4,
"Y",      "2",              0,               0,           2,            2,
"X",      "3",              4,               0,           0,            2,
"Y",      "3",              4,               0,           0,            5,
)

然后,这将是我想要的数据集:

desired_df <- tribble(
~type,     ~id,       ~less_than_two,  ~three-five,  ~five-six, ~more_than_six,     
"X",      "1",            "100%",           "0%",          "0%",       "100%",
"Y",      "1",              "0%",         "100%",          "0%",         "0%",
"X",      "2",              "0%",           "0%",       "33.3%",      "66.6%",
"Y",      "2",              "0%",           "0%",       "66.6%",       "33.3%",
"X",      "3",             "50%",           "0%",          "0%",      "28.5%",
"Y",      "3",             "50%",           "0%",          "0%",       "71.4%",
)

这个期望的数据集告诉我,当id为"3"并且价格箱超过六美元时,X类型有两本书,但Y类型有五本书。所以,这里是分布:X(28.5%(和Y(71.4%(。

注意:我在这里有一个类似的问题,但现在是更复杂的操作,我无法设法获得它:如何操作(聚合(R中的数据?

如果您能帮助我,我将不胜感激。提前谢谢。

我们可以使用findIntervalprice分成不同的组,计算每个idtypeprice_groupnumber_of_booksum,然后计算每个idprice_group的比率。最后,我们使用pivot_wider.

library(dplyr)
df %>% 
mutate(price_group = c("less_than_two", "three_five", "five_six", "more_than_six")
[findInterval(price, c(2, 5, 6), left.open = TRUE) + 1]) %>% 
group_by(id, type, price_group) %>%
summarise(number_of_book = sum(number_of_book)) %>% 
group_by(id, price_group) %>% 
mutate(n = number_of_book/sum(number_of_book) * 100) %>% 
select(-number_of_book) %>% 
tidyr::pivot_wider(names_from = price_group, values_from = n, 
values_fill = list(n = 0))
#  id    type  less_than_two more_than_six three_five five_six
#  <chr> <chr>         <dbl>         <dbl>      <dbl>    <dbl>
#1 1     X               100         100            0      0  
#2 1     Y                 0           0          100      0  
#3 2     X                 0         100            0     33.3
#4 2     Y                 0           0            0     66.7
#5 3     X                50          28.6          0      0  
#6 3     Y                50          71.4          0      0  

我们可以创建一个在"价格"列上带有cut的箱组,按"id"、"grp"分组,通过将"number_of_book"除以"number_of_book"sum来创建百分比,并重新塑造为"宽"格式

library(dplyr)
library(tidyr)
df %>% 
group_by(id,grp = cut(price, breaks = c(-Inf, 2, 5, 6, Inf), 
c('less_than_two', 'three-five', 'five-six', 'more_than_six')), add = TRUE) %>%
mutate(Perc = 100 *number_of_book/sum(number_of_book)) %>%
select(-price, -number_of_book) %>%
mutate(rn = row_number()) %>%
pivot_wider(names_from = grp, values_from = Perc, values_fill = list(Perc = 0)) %>%
select(-rn)
# A tibble: 6 x 6
# Groups:   id [3]
#  id    type  more_than_six less_than_two `three-five` `five-six`
#  <chr> <chr>         <dbl>         <dbl>        <dbl>      <dbl>
#1 1     X             100             100            0        0  
#2 1     Y               0               0          100        0  
#3 2     X             100               0            0       33.3
#4 2     Y               0               0            0       66.7
#5 3     X              28.6            50            0        0  
#6 3     Y              71.4            50            0        0  

也许不是完美的解决方案,但另一种方法是使用case_when来定义不同的类别:

library(tidyverse)
df %>% group_by(id, type, price) %>% 
mutate(Less2 = case_when(price <= 2 ~ cumsum(number_of_book)),
Three_Five = case_when(price %in% 3:5 ~ cumsum(number_of_book)),
Five_six = case_when(price %in% 5:6 ~ cumsum(number_of_book)),
More_six = case_when(price >6 ~ cumsum(number_of_book))) %>% 
replace(is.na(.),0) %>% 
ungroup(.) %>% 
group_by(id, type) %>% 
summarise_at(vars(Less2:More_six), ~sum(.)) %>%
ungroup(.) %>%
group_by(id) %>%
mutate_at(vars(Less2:More_six), ~ replace_na(./sum(.), 0)) 
# A tibble: 6 x 6
# Groups:   id [3]
id    type  Less2 Three_Five Five_six More_six
<chr> <chr> <dbl>      <dbl>    <dbl>    <dbl>
1 1     X       100          0      0      100  
2 1     Y         0        100    100        0  
3 2     X         0          0     33.3    100  
4 2     Y         0          0     66.7      0  
5 3     X        50          0      0       28.6
6 3     Y        50          0      0       71.4
```

最新更新