我的问题有点类似于这个问题,并且是建立在这个答案上,唯一的事情是我的数据是长格式,而不是宽,我想保持这种方式。
想知道是否有一种聪明的方法来计算这个答案中显示的weighted.mean()
,但是有很长的数据。
假设我的数据是这样的
库(tidyverse)
dft_w <- structure(list(obs = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), education = c("A",
"A", "B", "B", "B", "B", "A", "A"), Item = c("income", "weight",
"income", "weight", "income", "weight", "income", "weight"),
Amount = c(1000L, 10L, 2000L, 1L, 1500L, 5L, 2000L, 2L)), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame")); dft_w
# A tibble: 8 x 4
obs education Item Amount
<int> <chr> <chr> <int>
1 1 A income 1000
2 1 A weight 10
3 2 B income 2000
4 2 B weight 1
5 3 B income 1500
6 3 B weight 5
7 4 A income 2000
8 4 A weight 2
我想要得到这样的东西
# A tibble: 12 x 4
obs education Item Amount
<int> <chr> <chr> <dbl>
1 1 A income 1000
2 1 A weight 10
3 1 A weighted_income 1167.
4 2 B income 2000
5 2 B weight 1
6 2 B weighted_income 1583.
7 3 B income 1500
8 3 B weight 5
9 3 B weighted_income 1583.
10 4 A income 2000
11 4 A weight 2
12 4 A weighted_income 1167.
dft_w %>%
group_by(education) %>%
summarize(
Amount = rep(weighted.mean(Amount[Item == "income"], Amount[Item == "weight"]), length(unique(obs))),
obs = unique(obs),
Item = "weighted_income"
) %>%
bind_rows(dft_w, .) %>%
arrange(obs, education, Item)
# # A tibble: 12 x 4
# obs education Item Amount
# <int> <chr> <chr> <dbl>
# 1 1 A income 1000
# 2 1 A weight 10
# 3 1 A weighted_income 1167.
# 4 2 B income 2000
# 5 2 B weight 1
# 6 2 B weighted_income 1583.
# 7 3 B income 1500
# 8 3 B weight 5
# 9 3 B weighted_income 1583.
# 10 4 A income 2000
# 11 4 A weight 2
# 12 4 A weighted_income 1167.
注意,如果数据不包含相等数量的"income"
和"weight"
('x' and 'w' must have the same length
错误),则会出现错误。
这可以通过充分的过滤来抢占,例如:
dft_w %>%
slice(-1) %>% # just to trigger the fail, test the filter
group_by(obs, education) %>%
filter(all(c("income", "weight") %in% Item)) %>%
group_by(education) %>%
summarize(
Amount = rep(weighted.mean(Amount[Item == "income"], Amount[Item == "weight"]), length(unique(obs))),
obs = unique(obs),
Item = "weighted_income"
) %>%
bind_rows(slice(dft_w, -1), .) %>% # slice() only to keep the output consistent
arrange(obs, education, Item)
# # A tibble: 10 x 4
# obs education Item Amount
# <int> <chr> <chr> <dbl>
# 1 1 A weight 10
# 2 2 B income 2000
# 3 2 B weight 1
# 4 2 B weighted_income 1583.
# 5 3 B income 1500
# 6 3 B weight 5
# 7 3 B weighted_income 1583.
# 8 4 A income 2000
# 9 4 A weight 2
# 10 4 A weighted_income 2000
注意到没有两者的obs
/education
对将不会获得"weighted_income"
值。
另一种方法是在同一管道链中使用tidyr
的pivot_wider
和pivot_longer
,这样您就可以在返回长格式之前实际使用宽数据。这可能不是最有效的方法,但它允许保持"宽格式"。提示,技巧。
library(dplyr)
dft_w %>%
tidyr::pivot_wider(names_from = Item, values_from = Amount) %>%
group_by(education) %>%
mutate(weighted_income = weighted.mean(income, weight)) %>%
tidyr::pivot_longer(3:last_col(), names_to = "Item", values_to = "Amount")
输出:
# A tibble: 12 x 4
# Groups: education [2]
obs education Item Amount
<int> <chr> <chr> <dbl>
1 1 A income 1000
2 1 A weight 10
3 1 A weighted_income 1167.
4 2 B income 2000
5 2 B weight 1
6 2 B weighted_income 1583.
7 3 B income 1500
8 3 B weight 5
9 3 B weighted_income 1583.
10 4 A income 2000
11 4 A weight 2
12 4 A weighted_income 1167.
这是使用tibble::add_row
的另一种方法。我只是选择每个分组变量只有一个摘要:
library(dplyr)
library(purrr)
dft_w %>%
group_split(education) %>%
map_dfr(~ .x %>%
add_row(obs = .x$obs[1], education = .x$education[1],
Item = "weighted.mean", Amount = weighted.mean(.x$Amount[.x$Item == "income"],
.x$Amount[.x$Item == "weight"])))
# A tibble: 10 x 4
obs education Item Amount
<int> <chr> <chr> <dbl>
1 1 A income 1000
2 1 A weight 10
3 4 A income 2000
4 4 A weight 2
5 1 A weighted.mean 1167.
6 2 B income 2000
7 2 B weight 1
8 3 B income 1500
9 3 B weight 5
10 2 B weighted.mean 1583.