根据特定于每个列的多个列和阈值进行筛选(在 R 中)

我有一个包含~600列的数据帧。我想根据变量对数据帧进行分组，并根据特定于每个列和组的阈值过滤这些"感兴趣的列"中的n(通常占列总数的一小部分(。

我已经开始使用 dplyr 来做这件事了。我将使用iris数据集(因为我不是原创的(来演示：

library(tidyverse)
iris %>%
group_by(Species) %>%
mutate_at(vars(starts_with("Petal")), 
funs(threshold = quantile(., 0.5) - IQR(.)))

这将计算我的阈值(每组和每列(，并将它们放在名为Petal.Length_threshold和Petal.Width_threshold的新列中。

# A tibble: 150 x 7
# Groups:   Species [3]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Length_th…
<dbl>       <dbl>        <dbl>       <dbl> <fct>              <dbl>
1          5.1         3.5          1.4         0.2 setosa              1.32
2          4.9         3            1.4         0.2 setosa              1.32
3          4.7         3.2          1.3         0.2 setosa              1.32
4          4.6         3.1          1.5         0.2 setosa              1.32
5          5           3.6          1.4         0.2 setosa              1.32
6          5.4         3.9          1.7         0.4 setosa              1.32
7          4.6         3.4          1.4         0.3 setosa              1.32
8          5           3.4          1.5         0.2 setosa              1.32
9          4.4         2.9          1.4         0.2 setosa              1.32
10          4.9         3.1          1.5         0.1 setosa              1.32

我现在要做的是检查每一行，所有感兴趣的列都大于它们各自的(组和列(阈值。我是这样做的：

columns <- colnames(
iris %>% 
select(starts_with("Petal"))
)
threshold_cols <- paste(columns, "threshold", sep = "_")
filtered_iris <- iris %>%
group_by(Species) %>%
mutate_at(vars(starts_with("Petal")), 
funs(threshold = quantile(., 0.5) - IQR(.))) %>%
filter(UQ(as.name(columns[1])) > UQ(as.name(threshold_cols[1])) &
UQ(as.name(columns[2])) > UQ(as.name(threshold_cols[2])))

(请注意，UQ(as.name())是由于烦人的 dplyr 非标准计算使得很难将列名作为变量输入到 dplyr 函数中(。

问题是我想概括这一点(因为我想编写一个可重用的函数(，以便它能够将任意数量的"感兴趣的列"与其各自的(组和列(阈值进行比较。我能够使用starts_with()找出每种情况下我感兴趣的列数，这将是上述代码中columns的长度。

此外，写出UQ(as.name(columns[1])) > UQ(as.name(threshold_cols[1]))是丑陋和耗时的，因此有关如何改进这一点的任何建议也将不胜感激。

我通过编写自己的函数来添加到 dplyr 管道的末尾来尝试这一点。该功能变得非常笨拙且难以阅读，但这里是：

columns <- colnames(
iris %>% 
select(starts_with("Petal"))
)
threshold_fun <- function(x){
# obtain only columns of interest
reduced_x <- x[,columns]
# create empty threshold vector
threshold <- vector(mode = "numeric",
length = length(columns))
# fill vector with the threshold
# result should be a vector of 2 (in this case) with the
# Petal.Length threshold then the Petal.Width threshold
for (i in 1:length(columns)){
print(i)
threshold[i] <- quantile(reduced_x[,i], 0.5) + IQR(reduced_x[,i])
}
# for each row check that all elements are greater than 
# threshold. Result should be vector of TRUEs and FALSEs
filter_rows <- apply(reduced_x, 1, function(a)
sum(a > threshold) == length(columns))
# subset using vector above
filtered_x <- x[filter_rows,]
return(filtered_x)
}
my_filter <- iris %>%
group_by(Species) %>%
threshold_fun()

这给了我一个错误Error: Can't use matrix or array for column indexing.我尝试添加print()语句来尝试找出问题在此函数中发生的位置，并且它似乎在 for 循环中。仅此一项就给出了上述错误：quantile(reduced_x[,i], 0.5).

我的问题是，如何概括第一个 dplyr 代码或修复我的函数？

编辑

Calum You的回答很好，但如果将来对任何想知道这个问题的人有用，我已经设法使我的函数工作：

columns <- colnames(
iris %>% 
select(starts_with("Petal"))
)
threshold_fun <- function(x){
# obtain only columns of interest
reduced_x <- x[,columns]
# create empty threshold vector
threshold <- vector(mode = "numeric",
length = length(columns))
for (i in 1:length(columns)){
threshold[i] <- quantile(reduced_x[,i][[1]], 0.5) - IQR(reduced_x[,i][[1]])
}
# for each row check that all elements are greater than threshold. 
# Result should be vector of TRUEs and FALSEs
filter_rows <- apply(reduced_x, 1, function(a){
sum(a > threshold) == length(columns)}
)
# subset using vector above
filtered_x <- x[filter_rows,]
# 
return(filtered_x)
}
myiris <- iris %>%
group_by(Species) %>%
do(threshold_fun(.))

reduced_x[,i]返回数据帧，而reduced_x[,i][[1]]返回矢量。mean和quantile等函数需要向量
添加do()(do(threshold_fun(.))(可确保组得到尊重，并且该功能在数据帧的组而不是整个数据帧上执行(如直接管道连接到threshold_fun()(。更多信息在这里

(是的，现在这是一个荒谬的长问题(

对于很多像这样的任务，我发现自己使用mutate_at，通过gather数据并使用不同的分组来实现我想做的事情，通常更容易概括。下面是一个接受字符starts_with参数作为"感兴趣的列"选择器的示例，因为您说"我能够使用 starts_with(( 找出每种情况下我有多少感兴趣的列"。

基本上，我们可以将所有感兴趣的列放入colname和value.这使每个新行都成为旧行-列组合。然后，我们可以通过将colname包含在分组中来计算阈值，而无需诉诸mutate_at。为了filter结果，我们分组rowid而不是 colname 并使用all(因此对于每个原始行，如果感兴趣的列中的任何值超过其各自的阈值，则该行中的所有值都将被删除(。最后，我们可以spread并清理我们创建的临时变量。

带有"Sepal"参数的示例，以及带有mtcars的示例。

library(tidyverse)
filter_threshold <- function(df, group_col, starts_with){
group_col <- enquo(group_col)
df %>%
rowid_to_column() %>%
gather(colname, value, starts_with(starts_with)) %>%
group_by(!!group_col, colname) %>%
mutate(threshold = quantile(value, 0.5) - IQR(value)) %>%
group_by(rowid, !!group_col) %>%
filter(all(value > threshold)) %>%
ungroup() %>%
select(-threshold) %>%
spread(colname, value) %>%
select(-rowid)
}
iris %>% filter_threshold(Species, "Petal")
#> # A tibble: 122 x 5
#>    Sepal.Length Sepal.Width Species Petal.Length Petal.Width
#>           <dbl>       <dbl> <fct>          <dbl>       <dbl>
#>  1          5.1         3.5 setosa           1.4         0.2
#>  2          4.9         3   setosa           1.4         0.2
#>  3          4.6         3.1 setosa           1.5         0.2
#>  4          5           3.6 setosa           1.4         0.2
#>  5          5.4         3.9 setosa           1.7         0.4
#>  6          4.6         3.4 setosa           1.4         0.3
#>  7          5           3.4 setosa           1.5         0.2
#>  8          4.4         2.9 setosa           1.4         0.2
#>  9          5.4         3.7 setosa           1.5         0.2
#> 10          4.8         3.4 setosa           1.6         0.2
#> # ... with 112 more rows
iris %>% filter_threshold(Species, "Sepal")
#> # A tibble: 121 x 5
#>    Petal.Length Petal.Width Species Sepal.Length Sepal.Width
#>           <dbl>       <dbl> <fct>          <dbl>       <dbl>
#>  1          1.4         0.2 setosa           5.1         3.5
#>  2          1.4         0.2 setosa           4.9         3  
#>  3          1.3         0.2 setosa           4.7         3.2
#>  4          1.4         0.2 setosa           5           3.6
#>  5          1.7         0.4 setosa           5.4         3.9
#>  6          1.5         0.2 setosa           5           3.4
#>  7          1.5         0.1 setosa           4.9         3.1
#>  8          1.5         0.2 setosa           5.4         3.7
#>  9          1.6         0.2 setosa           4.8         3.4
#> 10          1.4         0.1 setosa           4.8         3  
#> # ... with 111 more rows
mtcars %>% filter_threshold(cyl, "d")
#> # A tibble: 26 x 11
#>      mpg   cyl    hp    wt  qsec    vs    am  gear  carb  disp  drat
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1  21       6   110  2.62  16.5     0     1     4     4  160   3.9 
#>  2  21       6   110  2.88  17.0     0     1     4     4  160   3.9 
#>  3  22.8     4    93  2.32  18.6     1     1     4     1  108   3.85
#>  4  18.7     8   175  3.44  17.0     0     0     3     2  360   3.15
#>  5  14.3     8   245  3.57  15.8     0     0     3     4  360   3.21
#>  6  22.8     4    95  3.15  22.9     1     0     4     2  141.  3.92
#>  7  19.2     6   123  3.44  18.3     1     0     4     4  168.  3.92
#>  8  17.8     6   123  3.44  18.9     1     0     4     4  168.  3.92
#>  9  16.4     8   180  4.07  17.4     0     0     3     3  276.  3.07
#> 10  17.3     8   180  3.73  17.6     0     0     3     3  276.  3.07
#> # ... with 16 more rows

创建于 2018-10-05 由 reprex 软件包 (v0.2.0(.

相关内容

最新更新

热门标签：