r-在group_by()中对数据进行子集设置和汇总会产生错误的结果



我以为我对dplyr和base R很有信心,但遇到了一个我无法理解的问题。我正在尝试对数据进行分组,并总结数据的子集

然而,我得到了错误的结果,这是令人担忧的,因为我一直在使用这种模式很多

我试图通过他们的物种组来总结所有的萼片长度,但只有那些相应的萼片宽度小于3.5的。

library(tidyverse)
## I am trying to sum all the Sepal.Length's by their Species group however only those where the corresponding Sepal.Width is less than 3.5.
###this produces NA in the newly created column
iris %>% 
group_by(Species) %>% 
summarize(new_col1=sum(Sepal.Length[iris$Sepal.Width<3.5]))
#### This produces a result,  however if you focus on the 'versicolor' you see a value of 166.3 
iris %>% 
group_by(Species) %>% 
summarize(new_col1=sum(Sepal.Length[iris$Sepal.Width<3.5],na.rm = TRUE))

#### However if you go to manually verify this amount you see a different answer (296.6)
iris %>% 
filter(Species=="versicolor",
Sepal.Width<3.5) %>% 
pull(Sepal.Length) %>% sum

当数据在group_by()summarize()内时,有没有办法对其进行子集设置,以便只有满足额外过滤标准的数据(类似于我的第一个例子(才进行

这是一个简单的修复方法。从总结中去掉$。您正在调用未分组的iris,因此它没有给出正确的行索引:

library(tidyverse)
iris %>% 
group_by(Species) %>% 
summarise(new_col1 = sum(Sepal.Length[Sepal.Width < 3.5]))
#> # A tibble: 3 x 2
#>   Species    new_col1
#>   <fct>         <dbl>
#> 1 setosa         135.
#> 2 versicolor     297.
#> 3 virginica      307.

作为检查:


iris %>% 
group_by(Species) %>% 
filter(Sepal.Width < 3.5) %>%
summarise(new_col1 = sum(Sepal.Length))
#> # A tibble: 3 x 2
#>   Species    new_col1
#>   <fct>         <dbl>
#> 1 setosa         135.
#> 2 versicolor     297.
#> 3 virginica      307.

最新更新