R中cov函数中的pairwise.complete.obs



我有一个模拟数据集(问题(,看起来像这样:

A = factor(rep("A",252));A
B = factor(rep("B",190));B
FACT = c(A,B)
x = rnorm(252)
y = rnorm(190)
d = c(x,y)
DATA = tibble(FACT,d);DATA

导致:

# A tibble: 442 x 2
FACT       d
<fct>  <dbl>
1 A     -0.172
2 A      1.23 
3 A     -0.589
4 A      0.512
5 A     -1.00 
6 A      0.532
7 A      0.562
8 A     -0.403
9 A      2.10 
10 A      0.649
# ... with 432 more rows

现在我有一个感兴趣的向量,长度为100。

z = rnorm(100)

我想分别找到向量z与每个向量x和y的协方差。在R中这样做我尝试过:

DATA %>%
group_by(FACT)%>%
dplyr::mutate(row = row_number())%>%
tidyr::pivot_wider(names_from = FACT, values_from = d)%>%
dplyr::select(-row)%>%
dplyr::mutate((across(.cols= everything(),~cov(.x,z,use= "pairwise.complete.obs"))))%>%
slice(n=1)%>%
tidyr::pivot_longer( cols = tidyselect::everything(), names_to = "FACT", values_to = "CoV")

但R向我报告了一个错误,即使用";成对完成obs";。

错误为:

Error in `dplyr::mutate()`:
! Problem while computing `..1 = (across(.cols =
everything(), ~cov(.x, z, use =
"pairwise.complete.obs")))`.
Caused by error in `across()`:
! Problem while computing column `A`.
Caused by error in `cov()`:
! incompatible dimensions

想象一下,我的现实世界问题有150个因素类别。如何修复?有什么帮助吗?

问题是您试图获得不同长度向量的协方差"成对完成.obs";只是包含在错误消息中,因为它正在打印引发错误的调用,但这不是问题所在。重要的一点是:

Caused by error in `cov()`:
! incompatible dimensions

即,您正在请求252长度向量与100长度向量的协方差。如果所有矢量的长度相同,则没有错误:

library(tidyverse)
A = factor(rep("A",100))
B = factor(rep("B",100))
FACT = c(A,B)
x = rnorm(100)
y = rnorm(100)
d = c(x,y)
DATA = tibble(FACT,d)
z = rnorm(100)
DATA %>%
group_by(FACT)%>%
dplyr::mutate(row = row_number())%>%
tidyr::pivot_wider(names_from = FACT, values_from = d) %>% 
dplyr::select(-row)%>%
dplyr::mutate((across(.cols= everything(),~cov(.x,z,use= "pairwise.complete.obs"))))%>%
slice(n=1)%>%
tidyr::pivot_longer( cols = tidyselect::everything(), names_to = "FACT", values_to = "CoV")
# # A tibble: 2 x 2
#   FACT      CoV
#   <chr>   <dbl>
# 1 A      0.0705
# 2 B     -0.214

编辑:

OP评论,

问题是pairwise.complete.obs不能解决所需向量长度不匹配的问题。

"成对完成.obs";用于删除其中任一向量为CCD_ 1的行。但是输入向量仍然必须具有相等的长度。例如:

# returns NA due to missing values
cov(
c(1,2,3,NA,5,6),
c(6,NA,2,NA,5,1)
)
# NA
# with pairwise.complete.obs, returns covariance for pairs without NAs
cov(
c(1,2,3,NA,5,6),
c(6,NA,2,NA,5,1),
use = "pairwise.complete.obs"
)
# -3.166667
# but still throws an error for unequal dimensions
cov(
c(1,2,3,NA,5,6,7,8),
c(6,NA,2,NA,5,1),
use = "pairwise.complete.obs"
)
# Error in cov(c(1, 2, 3, NA, 5, 6, 7, 8), c(6, NA, 2, NA, 5, 1), use = "pairwise.complete.obs") : 
#   incompatible dimensions

潜在的问题是协方差是基于的值。一种方法是,你的输入向量需要相同的长度,这样R就知道你想要的值";配对"因此,试图获得不同长度向量的协方差是没有意义的。

postscript:使用dplyr::summarize:可以大大简化代码

DATA %>%
group_by(FACT) %>%
summarize(CoV = cov(d, z, use= "pairwise.complete.obs"))
# # A tibble: 2 x 2
#   FACT      CoV
#   <chr>   <dbl>
# 1 A      0.0705
# 2 B     -0.214

最新更新