r语言 - 使用 dplyr 获取方差为零的列名



我试图在我的数据中找到任何方差为零的变量(即常量连续变量(。 我想出了如何使用 lapply 做到这一点,但我想使用 dplyr,因为我试图遵循整洁的数据原则。 我可以使用 dplyr 创建仅方差的向量,但这是最后一步,我发现值不等于零并返回让我感到困惑的变量名称。

这是代码

library(PReMiuM)
library(tidyverse)
#> ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
#> ✔ tibble  1.4.2     ✔ dplyr   0.7.4
#> ✔ tidyr   0.7.2     ✔ stringr 1.2.0
#> ✔ readr   1.2.0     ✔ forcats 0.2.0
#> ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()

setwd("~/Stapleton_Lab/Projects/Premium/hybridAnalysis/")
# read in data from analysis script
df <- read_csv("./hybrid.csv")
#> Parsed with column specification:
#> cols(
#>   .default = col_double(),
#>   Exp = col_character(),
#>   Pedi = col_character(),
#>   Harvest = col_character()
#> )
#> See spec(...) for full column specifications.
# checking for missing variable
# df %>% 
#     select_if(function(x) any(is.na(x))) %>% 
    # summarise_all(funs(sum(is.na(.))))

# grab month for analysis
may <- df %>% 
    filter(Month==5)
june <- df %>% 
    filter(Month==6)
july <- df %>% 
    filter(Month==7)
aug <- df %>% 
    filter(Month==8)
sept <- df %>% 
    filter(Month==9)
oct <- df %>% 
    filter(Month==10)
# check for zero variance in continuous covariates
numericVars <- grep("Min|Max",names(june))
zero <- which(lapply(june[numericVars],var)==0,useNames = TRUE)
noVar <- june %>% 
    select(numericVars) %>% 
    summarise_all(var) %>% 
    filter_if(all, all_vars(. != 0))
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical

通过一个可重现的示例,我认为您的目标如下。请注意,正如 Colin 所指出的,我没有处理过使用字符变量选择变量的问题。 有关详细信息,请参阅他的回答。

# reproducible data
mtcars2 <- mtcars
mtcars2$mpg <- mtcars2$qsec <- 7
library(dplyr)
mtcars2 %>% 
  summarise_all(var) %>% 
  select_if(function(.) . == 0) %>% 
  names()
# [1] "mpg"  "qsec"

就个人而言,我认为这会混淆您在做什么。 使用purrr包的以下之一(如果您希望保持整洁(将是我的偏好,并附有很好的评论。

library(purrr)
# Return a character vector of variable names which have 0 variance
names(mtcars2)[which(map_dbl(mtcars2, var) == 0)]
names(mtcars2)[map_lgl(mtcars2, function(x) var(x) == 0)]

如果你想优化它的速度,坚持使用基本R。

# Return a character vector of variable names which have 0 variance
names(mtcars2)[vapply(mtcars2, function(x) var(x) == 0, logical(1))]

你有两个问题。

1. 将列名作为变量传递给select()

关于这一点的小插曲在这里。 使用 DPLYR 编程。 此处的解决方案是使用 select 函数的select_at()作用域变体。

2. 方差等于 0

noVar <- june %>% 
    select_at(.vars=numericVars) %>% 
    summarise_all(.funs=var) %>%
    filter_all(any_vars(. == 0))

如果唯一计数为 1,则选择列,然后使用 @Benjamin 的示例数据 mtcars2 获取列名:

mtcars2 %>% 
  select_if(function(.) n_distinct(.) == 1) %>% 
  names()
# [1] "mpg"  "qsec"

这里的答案都很好,但是由于 DPLYR 1.0.0 弃用了作用域变体(例如 select_if、select_at、filter_all(,以下是使用 @Benjamin 给出的相同 repex 数据的更新:

mtcars2 <- mtcars
mtcars2$mpg <- mtcars2$qsec <- 7
mtcars2 %>% 
  map_df( ~ var(.)) %>% 
  select(where( ~ . == 0))

# A tibble: 1 x 2
    mpg  qsec
  <dbl> <dbl>
1     0     0

%>% names后:

[1] "mpg"  "qsec"

最新更新