给定一个列出用户、产品和产品功能的tibble
,我试图计算具有特定产品功能的不同产品用户的比例:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- tribble(
~users, ~product, ~feature,
"bob","iPhone","screen",
"bob","iPhone","camera",
"bob","iPhone","facial recognition",
"sally","Android","screen",
"sally","Android","camera",
"sally","Android","facial recognition",
"joe","Huawei","screen",
"joe","Huawei","camera",
"joe","Huawei","facial recognition",
"rachel","iPhone","screen",
"rachel","iPhone","camera",
"rachel","iPhone","fingerprint sensor"
)
# Get count of distinct users by product
df <- df %>%
group_by(product) %>%
mutate(n_users = n_distinct(users)) %>%
ungroup()
df
#> # A tibble: 12 x 4
#> users product feature n_users
#> <chr> <chr> <chr> <int>
#> 1 bob iPhone screen 2
#> 2 bob iPhone camera 2
#> 3 bob iPhone facial recognition 2
#> 4 sally Android screen 1
#> 5 sally Android camera 1
#> 6 sally Android facial recognition 1
#> 7 joe Huawei screen 1
#> 8 joe Huawei camera 1
#> 9 joe Huawei facial recognition 1
#> 10 rachel iPhone screen 2
#> 11 rachel iPhone camera 2
#> 12 rachel iPhone fingerprint sensor 2
# Count the fraction of distinct users with given product feature
df <- df %>%
group_by(product, feature) %>%
summarise(feature_fraction = n()/n_users,
.groups = "drop_last")
df
#> # A tibble: 12 x 3
#> # Groups: product [3]
#> product feature feature_fraction
#> <chr> <chr> <dbl>
#> 1 Android camera 1
#> 2 Android facial recognition 1
#> 3 Android screen 1
#> 4 Huawei camera 1
#> 5 Huawei facial recognition 1
#> 6 Huawei screen 1
#> 7 iPhone camera 1
#> 8 iPhone camera 1
#> 9 iPhone facial recognition 0.5
#> 10 iPhone fingerprint sensor 0.5
#> 11 iPhone screen 1
#> 12 iPhone screen 1
Created on 2020-10-23 by the reprex package (v0.3.0)
devtools::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.0.2 (2020-06-22)
#> os Windows 10 x64
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_United States.1252
#> ctype English_United States.1252
#> tz America/New_York
#> date 2020-10-23
#>
#> - Packages -------------------------------------------------------------------
#> package * version date lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
#> backports 1.1.10 2020-09-15 [1] CRAN (R 4.0.2)
#> callr 3.4.4 2020-09-07 [1] CRAN (R 4.0.2)
#> cli 2.0.2 2020-02-28 [1] CRAN (R 4.0.2)
#> crayon 1.3.4 2017-09-16 [1] CRAN (R 4.0.2)
#> desc 1.2.0 2018-05-01 [1] CRAN (R 4.0.2)
#> devtools 2.3.1 2020-07-21 [1] CRAN (R 4.0.2)
#> digest 0.6.25 2020-02-23 [1] CRAN (R 4.0.2)
#> dplyr * 1.0.2 2020-08-18 [1] CRAN (R 4.0.2)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2)
#> fansi 0.4.1 2020-01-08 [1] CRAN (R 4.0.2)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> generics 0.0.2 2018-11-29 [1] CRAN (R 4.0.2)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2)
#> htmltools 0.5.0 2020-06-16 [1] CRAN (R 4.0.2)
#> knitr 1.29 2020-06-23 [1] CRAN (R 4.0.2)
#> lifecycle 0.2.0 2020-03-06 [1] CRAN (R 4.0.2)
#> magrittr 1.5 2014-11-22 [1] CRAN (R 4.0.2)
#> memoise 1.1.0 2017-04-21 [1] CRAN (R 4.0.2)
#> pillar 1.4.6 2020-07-10 [1] CRAN (R 4.0.2)
#> pkgbuild 1.1.0 2020-07-13 [1] CRAN (R 4.0.2)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
#> pkgload 1.1.0 2020-05-29 [1] CRAN (R 4.0.2)
#> prettyunits 1.1.1 2020-01-24 [1] CRAN (R 4.0.2)
#> processx 3.4.4 2020-09-03 [1] CRAN (R 4.0.2)
#> ps 1.3.4 2020-08-11 [1] CRAN (R 4.0.2)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
#> R6 2.4.1 2019-11-12 [1] CRAN (R 4.0.2)
#> remotes 2.2.0 2020-07-21 [1] CRAN (R 4.0.2)
#> rlang 0.4.7 2020-07-09 [1] CRAN (R 4.0.2)
#> rmarkdown 2.3 2020-06-18 [1] CRAN (R 4.0.2)
#> rprojroot 1.3-2 2018-01-03 [1] CRAN (R 4.0.2)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
#> testthat 2.3.2 2020-03-02 [1] CRAN (R 4.0.2)
#> tibble 3.0.3 2020-07-10 [1] CRAN (R 4.0.2)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2)
#> usethis 1.6.1 2020-04-29 [1] CRAN (R 4.0.2)
#> utf8 1.1.4 2018-05-24 [1] CRAN (R 4.0.2)
#> vctrs 0.3.4 2020-08-29 [1] CRAN (R 4.0.2)
#> withr 2.3.0 2020-09-22 [1] CRAN (R 4.0.2)
#> xfun 0.16 2020-07-24 [1] CRAN (R 4.0.2)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
可以看出,对于具有相同汇总值的组密钥对,最终的tibble
具有多行。据我所知,这是summarise
的意外行为,似乎与mutate
返回的行为几乎相同。考虑到这个开放的github问题,新版本的summarise
似乎还没有解决所有的问题。我也可能只是愚蠢,如果有人能帮我回到正轨,我将不胜感激!
问题是每个组都有多个n_users
值。如果汇总函数返回多个值,则最新版本的dplyr
允许每个组返回多行。
如果你想假设每个组的n_users
的所有值都是相同的,那么你可以进行
df %>%
group_by(product, feature) %>%
summarise(feature_fraction = n()/first(n_users),
.groups = "drop_last")
这将确保每个组只返回一个值