R中向量中类似元素的加权平均值



我有两个向量xww 是一个权重的数字向量,其长度与 x 相同,给出用于x元素的权重。

我想给出向量x中差异较小的元素的加权平均值(例如 1e-1 或 1e-2),以减少向量x的长度。 例如,这些向量如下所示:

    w =c(1.459032e-01, 1.535375e-04, 1.829973e-04, 1.057226e-01, 2.833444e-04,
         2.559756e-04, 6.440060e-03, 6.294748e-02, 5.984383e-04, 2.772186e-04,
         4.869825e-05, 8.212092e-04, 1.233256e-01, 2.558964e-04, 3.990816e-03,
         1.665515e-01, 5.760450e-02, 5.803227e-04, 1.738252e-02, 2.431885e-02,
         1.280266e-03, 1.000000e-03, 1.000117e-03, 2.750921e-03, 3.588227e-03,
         3.489142e-04, 5.117452e-04, 5.117502e-04, 3.262697e-01, 3.060975e-01,
         3.089723e-02, 8.603438e-04, 8.603438e-04, 2.558906e-04, 2.558906e-04,
         7.559512e-04, 1.054060e-03, 8.318323e-04, 8.602753e-04, 8.603439e-04,
         8.269244e-04, 8.602833e-04, 8.979898e-04, 7.745014e-04, 5.117474e-04,
         5.691315e+00, 1.780994e+00, 2.416622e-03, 2.441406e-07, 2.441406e-07,
         3.065381e-05, 2.441406e-07, 2.441328e-07, 2.441324e-07, 2.884505e-07,
         2.441409e-07, 2.441411e-07, 2.441399e-07, 2.441406e-07, 2.441400e-07,
         2.441397e-07, 2.441406e-07, 2.441406e-07, 2.441406e-07, 2.441406e-07,
         2.441406e-07, 2.441406e-07, 2.441404e-07, 2.441406e-07, 1.920616e-03)
     x =c(0.3585121, 0.4399527, 0.5643820, 0.6776966, 0.7542579, 0.8374223, 0.9130900,
          0.9999472, 1.0793771, 1.1249381, 1.1700218, 1.2630534, 1.4131273, 1.4795500,
          1.5388979, 1.6587155, 1.7106946, 1.8248076, 1.9035620, 1.9512584, 2.0362027,
          2.1065388, 2.1525816, 2.2617268, 2.6090246, 2.7180285, 2.7704006, 2.8768953,
          2.9358206, 3.0000000, 3.0655239, 3.1266109, 3.1730078, 3.2681434, 3.3125953,
          3.3620683, 3.4191661, 3.4851182, 3.5373484, 3.5998778, 3.6622245, 3.7306358,
          3.8066598, 3.8726307, 3.9614728, 4.0515907, 4.0998298, 4.1870790, 0.4429813,
          0.5619184, 0.6437753, 0.6856169, 1.1212656, 1.2513217, 1.7290070, 1.9762596,
          2.0103108, 2.0440587, 2.2404542, 2.2742832, 2.5947769, 3.1292874, 3.1730608,
          3.4075734, 3.4651103, 3.5266852, 3.5886457, 3.7197153, 3.7967120, 4.0553866)

我知道如何根据矢量 x 的权重对它们进行排序,但我如何识别矢量 x 中的相似值,然后获得它们的加权平均值?

更新的答案

像这样的事情怎么样...?(见下面的代码)

我称您的原始向量为 origx 和 origw,因此重新排序的向量是 x 和 w。该代码适用于被销毁的 x 和 w(称为 xtemp 和 wtemp)的临时副本,并在变量 xnew 和 wnew 中构建新的 x 和 w(即您寻找的"较短"向量)。

简单来说,代码查看 xtemp 并找到超过阈值大小(例如 0.05)的第一个间隙,并将从 xtemp 开始到该"大"间隙的所有元素组合在一起。(如果没有这样的差距,它将整个 xtemp 作为一个组。然后,代码将该组转换为称为 wgroup 的单个权重(组权重的总和)和称为 xgroup 的单个代表性 x 值(使得 xgroup*wgroup 与所有组元素的加权和相同)。然后,我们将 xgroup 和 wgroup 保存到向量 xnew 和 wnew 中,清除当前组(通过将其从 xtemp 和 wtemp 中消除),然后以相同的方式继续,直到所有内容都分组。

试运行一下,看看你怎么想:)

origw = c(1.459032e-01, 1.535375e-04, 1.829973e-04, 1.057226e-01, 2.833444e-04,
          2.559756e-04, 6.440060e-03, 6.294748e-02, 5.984383e-04, 2.772186e-04,
          4.869825e-05, 8.212092e-04, 1.233256e-01, 2.558964e-04, 3.990816e-03,
          1.665515e-01, 5.760450e-02, 5.803227e-04, 1.738252e-02, 2.431885e-02,
          1.280266e-03, 1.000000e-03, 1.000117e-03, 2.750921e-03, 3.588227e-03,
          3.489142e-04, 5.117452e-04, 5.117502e-04, 3.262697e-01, 3.060975e-01,
          3.089723e-02, 8.603438e-04, 8.603438e-04, 2.558906e-04, 2.558906e-04,
          7.559512e-04, 1.054060e-03, 8.318323e-04, 8.602753e-04, 8.603439e-04,
          8.269244e-04, 8.602833e-04, 8.979898e-04, 7.745014e-04, 5.117474e-04,
          5.691315e+00, 1.780994e+00, 2.416622e-03, 2.441406e-07, 2.441406e-07,
          3.065381e-05, 2.441406e-07, 2.441328e-07, 2.441324e-07, 2.884505e-07,
          2.441409e-07, 2.441411e-07, 2.441399e-07, 2.441406e-07, 2.441400e-07,
          2.441397e-07, 2.441406e-07, 2.441406e-07, 2.441406e-07, 2.441406e-07,
          2.441406e-07, 2.441406e-07, 2.441404e-07, 2.441406e-07, 1.920616e-03)
origx = c(0.3585121, 0.4399527, 0.5643820, 0.6776966, 0.7542579, 0.8374223, 0.9130900,
          0.9999472, 1.0793771, 1.1249381, 1.1700218, 1.2630534, 1.4131273, 1.4795500,
          1.5388979, 1.6587155, 1.7106946, 1.8248076, 1.9035620, 1.9512584, 2.0362027,
          2.1065388, 2.1525816, 2.2617268, 2.6090246, 2.7180285, 2.7704006, 2.8768953,
          2.9358206, 3.0000000, 3.0655239, 3.1266109, 3.1730078, 3.2681434, 3.3125953,
          3.3620683, 3.4191661, 3.4851182, 3.5373484, 3.5998778, 3.6622245, 3.7306358,
          3.8066598, 3.8726307, 3.9614728, 4.0515907, 4.0998298, 4.1870790, 0.4429813,
          0.5619184, 0.6437753, 0.6856169, 1.1212656, 1.2513217, 1.7290070, 1.9762596,
          2.0103108, 2.0440587, 2.2404542, 2.2742832, 2.5947769, 3.1292874, 3.1730608,
          3.4075734, 3.4651103, 3.5266852, 3.5886457, 3.7197153, 3.7967120, 4.0553866)
reord = order(origx)
x = origx[reord]
w = origw[reord]
xnew = wnew = c()
thresh = 0.05
xtemp = x
wtemp = w
while (length(xtemp) > 0) {
nextgap = which(diff(xtemp) > thresh)[1]
if (!is.na(nextgap)) {
    group = seq_len(nextgap)
} else {
    group = seq_along(xtemp)
}
xgroup = sum((xtemp*wtemp)[group])/sum(wtemp[group])
wgroup = sum(wtemp[group])
xnew = c(xnew, xgroup)
wnew = c(wnew, wgroup)
xtemp = xtemp[-group]
wtemp = wtemp[-group]
}

旧响应如下(被上述内容取代...

我建议对 x 和 w 重新排序,使 x 严格按数字顺序排列,然后使用 diff 函数:

reord = order(x)
x2 = x[reord]
w2 = w[reord]
which(diff(x2)<0.01)

上面的最后一个命令指示x2中的哪些元素(x 的排序版本)在下一个最高元素的 0.01 范围内。第一个值是 2,因为 x2 的元素 2 和 3 就是这样一个例子:x2[2]=0.4399527x2[3]=0.4429813 .

另外,如果你这样做

sort(diff(x2))

您可以看到按数字顺序排列的所有差异,这可能有助于您确定合适的截止值应该是多少。

最新更新