我试图在我的数据中找到任何方差为零的变量(即常量连续变量(。 我想出了如何使用 lapply 做到这一点,但我想使用 dplyr,因为我试图遵循整洁的数据原则。 我可以使用 dplyr 创建仅方差的向量,但这是最后一步,我发现值不等于零并返回让我感到困惑的变量名称。
这是代码
library(PReMiuM)
library(tidyverse)
#> ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
#> ✔ tibble 1.4.2 ✔ dplyr 0.7.4
#> ✔ tidyr 0.7.2 ✔ stringr 1.2.0
#> ✔ readr 1.2.0 ✔ forcats 0.2.0
#> ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
setwd("~/Stapleton_Lab/Projects/Premium/hybridAnalysis/")
# read in data from analysis script
df <- read_csv("./hybrid.csv")
#> Parsed with column specification:
#> cols(
#> .default = col_double(),
#> Exp = col_character(),
#> Pedi = col_character(),
#> Harvest = col_character()
#> )
#> See spec(...) for full column specifications.
# checking for missing variable
# df %>%
# select_if(function(x) any(is.na(x))) %>%
# summarise_all(funs(sum(is.na(.))))
# grab month for analysis
may <- df %>%
filter(Month==5)
june <- df %>%
filter(Month==6)
july <- df %>%
filter(Month==7)
aug <- df %>%
filter(Month==8)
sept <- df %>%
filter(Month==9)
oct <- df %>%
filter(Month==10)
# check for zero variance in continuous covariates
numericVars <- grep("Min|Max",names(june))
zero <- which(lapply(june[numericVars],var)==0,useNames = TRUE)
noVar <- june %>%
select(numericVars) %>%
summarise_all(var) %>%
filter_if(all, all_vars(. != 0))
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
#> Warning in .p(.tbl[[vars[[i]]]], ...): coercing argument of type 'double'
#> to logical
通过一个可重现的示例,我认为您的目标如下。请注意,正如 Colin 所指出的,我没有处理过使用字符变量选择变量的问题。 有关详细信息,请参阅他的回答。
# reproducible data
mtcars2 <- mtcars
mtcars2$mpg <- mtcars2$qsec <- 7
library(dplyr)
mtcars2 %>%
summarise_all(var) %>%
select_if(function(.) . == 0) %>%
names()
# [1] "mpg" "qsec"
就个人而言,我认为这会混淆您在做什么。 使用purrr
包的以下之一(如果您希望保持整洁(将是我的偏好,并附有很好的评论。
library(purrr)
# Return a character vector of variable names which have 0 variance
names(mtcars2)[which(map_dbl(mtcars2, var) == 0)]
names(mtcars2)[map_lgl(mtcars2, function(x) var(x) == 0)]
如果你想优化它的速度,坚持使用基本R。
# Return a character vector of variable names which have 0 variance
names(mtcars2)[vapply(mtcars2, function(x) var(x) == 0, logical(1))]
你有两个问题。
1. 将列名作为变量传递给select()
关于这一点的小插曲在这里。 使用 DPLYR 编程。 此处的解决方案是使用 select 函数的select_at()
作用域变体。
2. 方差等于 0
noVar <- june %>%
select_at(.vars=numericVars) %>%
summarise_all(.funs=var) %>%
filter_all(any_vars(. == 0))
如果唯一计数为 1,则选择列,然后使用 @Benjamin 的示例数据 mtcars2 获取列名:
mtcars2 %>%
select_if(function(.) n_distinct(.) == 1) %>%
names()
# [1] "mpg" "qsec"
这里的答案都很好,但是由于 DPLYR 1.0.0 弃用了作用域变体(例如 select_if、select_at、filter_all(,以下是使用 @Benjamin 给出的相同 repex 数据的更新:
mtcars2 <- mtcars
mtcars2$mpg <- mtcars2$qsec <- 7
mtcars2 %>%
map_df( ~ var(.)) %>%
select(where( ~ . == 0))
给
# A tibble: 1 x 2
mpg qsec
<dbl> <dbl>
1 0 0
或%>% names
后:
[1] "mpg" "qsec"