r语言 - 长数据分组加权平均值



我的问题有点类似于这个问题,并且是建立在这个答案上,唯一的事情是我的数据是长格式,而不是宽,我想保持这种方式。

想知道是否有一种聪明的方法来计算这个答案中显示的weighted.mean(),但是有很长的数据。

假设我的数据是这样的

库(tidyverse)

dft_w <- structure(list(obs = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), education = c("A", 
"A", "B", "B", "B", "B", "A", "A"), Item = c("income", "weight", 
"income", "weight", "income", "weight", "income", "weight"), 
Amount = c(1000L, 10L, 2000L, 1L, 1500L, 5L, 2000L, 2L)), row.names = c(NA, 
-8L), class = c("tbl_df", "tbl", "data.frame")); dft_w
# A tibble: 8 x 4
obs education Item   Amount
<int> <chr>     <chr>   <int>
1     1 A         income   1000
2     1 A         weight     10
3     2 B         income   2000
4     2 B         weight      1
5     3 B         income   1500
6     3 B         weight      5
7     4 A         income   2000
8     4 A         weight      2

我想要得到这样的东西

# A tibble: 12 x 4
obs education Item            Amount
<int> <chr>     <chr>            <dbl>
1     1 A         income           1000 
2     1 A         weight             10 
3     1 A         weighted_income  1167.
4     2 B         income           2000 
5     2 B         weight              1 
6     2 B         weighted_income  1583.
7     3 B         income           1500 
8     3 B         weight              5 
9     3 B         weighted_income  1583.
10     4 A         income           2000 
11     4 A         weight              2 
12     4 A         weighted_income  1167.
dft_w %>%
group_by(education) %>%
summarize(
Amount = rep(weighted.mean(Amount[Item == "income"], Amount[Item == "weight"]), length(unique(obs))), 
obs = unique(obs), 
Item = "weighted_income"
) %>%
bind_rows(dft_w, .) %>%
arrange(obs, education, Item)
# # A tibble: 12 x 4
#      obs education Item            Amount
#    <int> <chr>     <chr>            <dbl>
#  1     1 A         income           1000 
#  2     1 A         weight             10 
#  3     1 A         weighted_income  1167.
#  4     2 B         income           2000 
#  5     2 B         weight              1 
#  6     2 B         weighted_income  1583.
#  7     3 B         income           1500 
#  8     3 B         weight              5 
#  9     3 B         weighted_income  1583.
# 10     4 A         income           2000 
# 11     4 A         weight              2 
# 12     4 A         weighted_income  1167.

注意,如果数据不包含相等数量的"income""weight"('x' and 'w' must have the same length错误),则会出现错误。

这可以通过充分的过滤来抢占,例如:

dft_w %>%
slice(-1) %>%                         # just to trigger the fail, test the filter
group_by(obs, education) %>%
filter(all(c("income", "weight") %in% Item)) %>%
group_by(education) %>%
summarize(
Amount = rep(weighted.mean(Amount[Item == "income"], Amount[Item == "weight"]), length(unique(obs))), 
obs = unique(obs), 
Item = "weighted_income"
) %>%
bind_rows(slice(dft_w, -1), .) %>%    # slice() only to keep the output consistent
arrange(obs, education, Item)
# # A tibble: 10 x 4
#      obs education Item            Amount
#    <int> <chr>     <chr>            <dbl>
#  1     1 A         weight             10 
#  2     2 B         income           2000 
#  3     2 B         weight              1 
#  4     2 B         weighted_income  1583.
#  5     3 B         income           1500 
#  6     3 B         weight              5 
#  7     3 B         weighted_income  1583.
#  8     4 A         income           2000 
#  9     4 A         weight              2 
# 10     4 A         weighted_income  2000 

注意到没有两者的obs/education对将不会获得"weighted_income"值。

另一种方法是在同一管道链中使用tidyrpivot_widerpivot_longer,这样您就可以在返回长格式之前实际使用宽数据。这可能不是最有效的方法,但它允许保持"宽格式"。提示,技巧。

library(dplyr)
dft_w %>% 
tidyr::pivot_wider(names_from = Item, values_from = Amount) %>% 
group_by(education) %>% 
mutate(weighted_income = weighted.mean(income, weight)) %>% 
tidyr::pivot_longer(3:last_col(), names_to = "Item", values_to = "Amount")

输出:

# A tibble: 12 x 4
# Groups:   education [2]
obs education Item            Amount
<int> <chr>     <chr>            <dbl>
1     1 A         income           1000 
2     1 A         weight             10 
3     1 A         weighted_income  1167.
4     2 B         income           2000 
5     2 B         weight              1 
6     2 B         weighted_income  1583.
7     3 B         income           1500 
8     3 B         weight              5 
9     3 B         weighted_income  1583.
10     4 A         income           2000 
11     4 A         weight              2 
12     4 A         weighted_income  1167.

这是使用tibble::add_row的另一种方法。我只是选择每个分组变量只有一个摘要:

library(dplyr)
library(purrr)
dft_w %>%
group_split(education) %>%
map_dfr(~ .x %>% 
add_row(obs = .x$obs[1], education = .x$education[1],
Item = "weighted.mean", Amount = weighted.mean(.x$Amount[.x$Item == "income"], 
.x$Amount[.x$Item == "weight"])))
# A tibble: 10 x 4
obs education Item          Amount
<int> <chr>     <chr>          <dbl>
1     1 A         income         1000 
2     1 A         weight           10 
3     4 A         income         2000 
4     4 A         weight            2 
5     1 A         weighted.mean  1167.
6     2 B         income         2000 
7     2 B         weight            1 
8     3 B         income         1500 
9     3 B         weight            5 
10     2 B         weighted.mean  1583.

最新更新