R:当向量标记为Hmisc时,"median()"返回的类不一致



我有一列已经用Hmisc R包进行了标记。列的类为c("labelled", "numeric")。如果我计算整个列的median(),则返回的中值保持为c("labelled", "numeric")

但是,如果我计算了两个子组中的median(),则一个中值返回为同一类,但另一个返回为"numeric"类。返回的不同类导致dplyr::summarize()中出现错误。

  1. 有人能帮我理解为什么这个类会改变吗
  2. 我能做些什么来解决这个问题?仅供参考,此代码出现在包的内部,我希望避免对已标记为Hmisc的变量进行特殊编码
library(magrittr)
data <-
structure(
list(
cd4_count = c(
30, 97, 210, NA, 358, 242, 126,
792, 6, 145, 22, 150, 43, 23, 39, 953, 357, 427, 367, 239, 72,
61, 61, 438, 392, 1092, 245, 326, 42, 135, 199, 158, 17, NA,
287, 187, 252, 477, 157, NA, NA, 362, NA, 183, 885, 109, 321,
286, 142, 797
),
unsuccessful = c(
0, 0, 0, 1, 0, 0, 1, 1, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0
)
),
row.names = c(NA, 50L),
class = "data.frame"
)
# Add label to CD4 count, using Hmisc package
Hmisc::label(data$cd4_count) <- "CD4 count"
# the classes here are all the same
data$cd4_count %>% class()
#> [1] "labelled" "numeric"
data$cd4_count[data$unsuccessful == 0] %>% class()
#> [1] "labelled" "numeric"
data$cd4_count[data$unsuccessful == 1] %>% class()
#> [1] "labelled" "numeric"

# Why are the results not the same class?!?!
data$cd4_count[data$unsuccessful == 0] %>% median(na.rm = TRUE) %>% class()
#> [1] "labelled" "numeric"
data$cd4_count[data$unsuccessful == 1] %>% median(na.rm = TRUE) %>% class()
#> [1] "numeric"
# Because the classes are different, I cannot run this code
data %>%
dplyr::group_by(unsuccessful) %>%
dplyr::summarize_at(dplyr::vars(cd4_count), median, na.rm = TRUE)
#> Error: Problem with `summarise()` input `cd4_count`.
#> x Input `cd4_count` must return compatible vectors across groups
#> i Result type for group 1 (unsuccessful = 0): <labelled>.
#> i Result type for group 2 (unsuccessful = 1): <double>.
#> i Input `cd4_count` is `(function (x, na.rm = FALSE, ...) ...`.

创建于2021-04-27由reprex包(v2.0.0(

user20650在评论中指出,属性的丢弃和保留取决于x的向量长度。

当我们查看median.default方法的代码时,我们可以看到原因。如果length(x)是偶数,则使用mean(在median内部(,否则x只是sort的ed和subset,与mean不同,它不会删除属性。

# lets have a look at the median.default method
function (x, na.rm = FALSE, ...) 
{
if (is.factor(x) || is.data.frame(x)) 
stop("need numeric data")
if (length(names(x))) 
names(x) <- NULL
if (na.rm) 
x <- x[!is.na(x)]
else if (any(is.na(x))) 
return(x[FALSE][NA])
n <- length(x)
if (n == 0L) 
return(x[FALSE][NA])
half <- (n + 1L)%/%2L
if (n%%2L == 1L) 
# when length is odd: attribute is kept
sort(x, partial = half)[half] 
# when length is even: `mean` drops attribute
else mean(sort(x, partial = half + 0L:1L)[half + 0L:1L]) 
}

由reprex包(v0.3.0(于2021-04-28创建

让我们再来看看不同的矢量及其行为。我们可以定义一个keep_attr函数,它将保留被包装函数和输入的属性。

x1 <- 1
Hmisc::label(x1) = "qw"
class(median(x1)) # keeps attribute
#> [1] "labelled" "numeric"
class(mean(x1))  # drops attribute
#> [1] "numeric"
x2 <- c(1, 2)
Hmisc::label(x2) = "qw"
class(median(x2)) # uses mean
#> [1] "numeric"
class(mean(x2))
#> [1] "numeric"
x3 <- c(1, 2, NA)
Hmisc::label(x3) = "qw"
class(median(x3)) # doesn't use mean
#> [1] "labelled" "numeric"
class(mean(x3))
#> [1] "numeric"
keep_attr <- function(.f, x, ...) {
x_att <- attributes(x)
res <- .f(x, ...)
attributes(res) <- x_att
res
}
class(keep_attr(median, x2))
#> [1] "labelled" "numeric"
class(keep_attr(mean, x2))
#> [1] "labelled" "numeric"
keep_attr(median, x3, na.rm = TRUE)
#> qw 
#> [1] 1.5

由reprex包(v0.3.0(于2021-04-28创建

更新关于您的dplyr问题,我现在能够重现该问题(我最初忘记标记cd4_count列,并认为这是dplyr版本控制问题(。然而,keep_attr的变通方法似乎正在发挥作用。

library(dplyr)
data <-
structure(
list(
cd4_count = c(
30, 97, 210, NA, 358, 242, 126,
792, 6, 145, 22, 150, 43, 23, 39, 953, 357, 427, 367, 239, 72,
61, 61, 438, 392, 1092, 245, 326, 42, 135, 199, 158, 17, NA,
287, 187, 252, 477, 157, NA, NA, 362, NA, 183, 885, 109, 321,
286, 142, 797
),
unsuccessful = c(
0, 0, 0, 1, 0, 0, 1, 1, 0, 1,
1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0
)
),
row.names = c(NA, 50L),
class = "data.frame"
)
# Add label to CD4 count, using Hmisc package
Hmisc::label(data$cd4_count) <- "CD4 count"
data %>%
dplyr::group_by(unsuccessful) %>%
dplyr::summarize_at(dplyr::vars(cd4_count), median, na.rm = TRUE)
#> Error: Problem with `summarise()` input `cd4_count`.
#> x Input `cd4_count` must return compatible vectors across groups
#> i Input `cd4_count` is `(function (x, na.rm = FALSE, ...) ...`.
#> i Result type for group 1 (unsuccessful = 0): <labelled>.
#> i Result type for group 2 (unsuccessful = 1): <double>.
data %>%
dplyr::group_by(unsuccessful) %>%
dplyr::summarize_at(dplyr::vars(cd4_count), ~ keep_attr(median, .x, na.rm = TRUE))
#> # A tibble: 2 x 2
#>   unsuccessful cd4_count 
#>          <dbl> <labelled>
#> 1            0 210.0     
#> 2            1 135.5

由reprex包(v0.3.0(于2021-04-28创建

最新更新