在 R 中计算加权组均值的快速方法?

给定纵向数据，如何计算矩阵，其中每列表示给定变量的加权组平均值？

我开发了一种需要循环的方法，但它太慢了。我认为这可能是矢量化的，但解决方案对我来说是逃避的。

这是我目前的方法：

library(foreach)
# N is sample size
# g is the number of groups
# p is the number of variables
get_group_mean_matrix <- function(N, g, p){
X <- matrix(rbinom(N*p, 10, .5), N)
f <- sort((1:(N)) %% g + 1)
w <- runif(N)
dmmat <- foreach(i = unique(f), .combine = rbind) %do% {
idx <- which(f == i)
ws <- w[idx]/sum(w[idx])
t((t(X[idx,]) %*% ws)) %x% rep(1, length(idx))
}
dmmat
}
> set.seed(666)
> get_group_mean_matrix(12, 3, 5)
[,1]     [,2]     [,3]     [,4]     [,5]
[1,] 5.261103 4.074266 5.828070 4.452703 5.990165
[2,] 5.261103 4.074266 5.828070 4.452703 5.990165
[3,] 5.261103 4.074266 5.828070 4.452703 5.990165
[4,] 5.261103 4.074266 5.828070 4.452703 5.990165
[5,] 5.560556 4.241942 3.698828 5.572523 4.212532
[6,] 5.560556 4.241942 3.698828 5.572523 4.212532
[7,] 5.560556 4.241942 3.698828 5.572523 4.212532
[8,] 5.560556 4.241942 3.698828 5.572523 4.212532
[9,] 4.289029 4.771115 5.150607 4.424339 6.346775
[10,] 4.289029 4.771115 5.150607 4.424339 6.346775
[11,] 4.289029 4.771115 5.150607 4.424339 6.346775
[12,] 4.289029 4.771115 5.150607 4.424339 6.346775
> library(microbenchmark)
> microbenchmark(get_group_mean_matrix(1200, 300, 50))
Unit: milliseconds
expr      min       lq     mean   median       uq      max neval
get_group_mean_matrix(1200, 300, 50) 76.33337 77.39607 80.76586 78.39808 84.46984 93.40047   100

最初，我尝试使用lfe::demeanlist执行此操作，但它给了我错误的输出！

library(lfe)
get_group_mean_matrix_lfe <- function(N, g, p){
X <- matrix(rbinom(N*p, 10, .5), N)
f <- sort((1:(N)) %% g + 1)
w <- runif(N)
X - demeanlist(X, list(factor(f)), weights = w)
}
> set.seed(666)
> get_group_mean_matrix_lfe(12, 3, 5)
[,1]     [,2]     [,3]     [,4]     [,5]
[1,] 5.138068 4.001781 5.415467 4.722947 5.999827
[2,] 5.138068 4.001781 5.415467 4.722947 5.999827
[3,] 5.138068 4.001781 5.415467 4.722947 5.999827
[4,] 5.138068 4.001781 5.415467 4.722947 5.999827
[5,] 5.197308 4.067657 3.202478 5.866451 4.066385
[6,] 5.197308 4.067657 3.202478 5.866451 4.066385
[7,] 5.197308 4.067657 3.202478 5.866451 4.066385
[8,] 5.197308 4.067657 3.202478 5.866451 4.066385
[9,] 4.189951 4.887720 4.953305 4.501874 6.385846
[10,] 4.189951 4.887720 4.953305 4.501874 6.385846
[11,] 4.189951 4.887720 4.953305 4.501874 6.385846
[12,] 4.189951 4.887720 4.953305 4.501874 6.385846
> library(microbenchmark)
> microbenchmark(get_group_mean_matrix_lfe(1200, 300, 50))
Unit: milliseconds
expr      min       lq     mean   median       uq      max neval
get_group_mean_matrix_lfe(1200, 300, 50) 6.107421 6.202426 6.500411 6.293648 6.582943 8.350876   100

虽然它快了很多...

我会接受两种答案中的任何一种：

解释lfe::demeanlist在加权情况下正在做什么。当我从平均值中减去加权偏差时，我不应该得到加权平均值吗？知道了这一点，我如何计算加权组均值矩阵？
不涉及举止性计算加权组均值矩阵的方法。

注意：使用RcppEigen将%*%替换为矩阵乘法函数可以加快速度，但还不够。我认为问题在于循环。

下面是一些示例输入：

f X1 X2 X3 X4 X5
1  1  6  5  7  3  6
2  1  6  4  5  5  6
3  1  5  6  3  6  6
4  1  3  5  4  3  5
5  2  5  4  7  7  7
6  2  4  1  4  2  6
7  2  5  6  6  6  5
8  2  6  7  2  5  4
9  3  5  3  4  6  9
10 3  6  6  5  5  6
11 3  5  7  4  6  8
12 3  5  3  7  8  6

其中f是分组因素。

Hurr durr 我所要做的就是将进入demeanlisthurr durr 的权重平方根

library(foreach)
get_group_mean_matrix <- function(N, g, p){
X <- matrix(rbinom(N*p, 10, .5), N)
f <- sort((1:(N)) %% g + 1)
w <- runif(N)
dmmat <- foreach(i = unique(f), .combine = rbind) %do% {
idx <- which(f == i)
ws <- w[idx]/sum(w[idx])
t((t(X[idx,]) %*% ws)) %x% rep(1, length(idx))
}
dmmat
}
set.seed(666)
A <- get_group_mean_matrix(12, 3, 5)
library(lfe)
get_group_mean_matrix_lfe <- function(N, g, p){
X <- matrix(rbinom(N*p, 10, .5), N)
f <- sort((1:(N)) %% g + 1)
w <- runif(N)
X - demeanlist(X, list(factor(f)), weights = w^.5)
}
set.seed(666)
B <- get_group_mean_matrix_lfe(12, 3, 5)
> all.equal(A, B)
[1] TRUE

相关内容

最新更新

热门标签：