r-计算分组数据帧中的唯一字符值:dplyr::count()、stringr::str_count()和/或purrr



在purrr::map((中使用dplyr::count((时出错

我想要按行子集计数的唯一字符值的数据帧完整的数据集是1000多行,许多肿瘤类型

玩具示例:

library(tidyverse)
df <- tibble::tribble(
~tumour, ~impact.on.surgery, ~impact.on.radiotherapy, ~impact.on.chemotherapy, ~impact.on.biologics, ~impact.on.immunotherapy,
'Breast', NA,               NA,               NA,               'Interrupted',      NA,               
'Breast', NA,               NA,               NA,               'As.planned',       NA,               
'Breast', NA,               NA,               NA,               'Interrupted',      NA,               
'Breast', NA,               NA,               'As.planned',     NA,                NA,               
'Breast', NA,               NA,               NA,               NA,               NA,               
'Breast', NA,               NA,               NA,               'Interrupted',      NA             
> df
# A tibble: 6 x 6
tumour impact.on.surgery impact.on.radiotherapy impact.on.chemotherapy impact.on.biologics impact.on.immunotherapy
<chr>  <lgl>             <lgl>                  <chr>                  <chr>               <lgl>                  
1 Breast NA                NA                     NA                     Interrupted         NA                     
2 Breast NA                NA                     NA                     As.planned          NA                     
3 Breast NA                NA                     NA                     Interrupted         NA                     
4 Breast NA                NA                     As.planned             NA                  NA                     
5 Breast NA                NA                     NA                     NA                  NA                     
6 Breast NA                NA                     NA                     Interrupted         NA                     
)

所需输出:理想情况下,作为按肿瘤类型命名的数据帧列表,因此我可以稍后reduce(bind_rows, .id = 'tumour')添加.id列标签

$ Breast
# A tibble: 2 x 6
impact      impact.on.surgery impact.on.radiotherapy impact.on.chemotherapy impact.on.biologics impact.on.immunotherapy
<chr>                   <dbl>                  <dbl>                  <dbl>               <dbl>                   <dbl>
1 Interrupted                 0                      0                      0                   3                       0
2 As.planned                  0                      0                      1                   1                       0

迄今为止已尝试:

# Gets single row tibble, but not sure how to `.id` label each row, map across all values & bind
df %>%   
summarise(across(starts_with('impact'), ~sum(str_count(.x, 'As.planned'), na.rm = T)))
# A tibble: 1 x 5
impact.on.surgery impact.on.radiotherapy impact.on.chemotherapy impact.on.biologics impact.on.immunotherapy
<int>                  <int>                  <int>               <int>                   <int>
1                 0                      0                      1                   1                       0
# ?Counts all variable values (no need to specify), simpler code, but also counts `NAs` and I can't pivot that to a wide form as it has 'counted' the tumour
df %>% 
map_dfr(~count(data.frame(x=.), x), .id = 'var')
var           x n
1                  tumour      Breast 6
2       impact.on.surgery        <NA> 6
3  impact.on.radiotherapy        <NA> 6
4  impact.on.chemotherapy  As.planned 1
5  impact.on.chemotherapy        <NA> 5
6     impact.on.biologics  As.planned 1
7     impact.on.biologics Interrupted 3
8     impact.on.biologics        <NA> 2
9 impact.on.immunotherapy        <NA> 6

map的一个选项是在要计数的元素上循环,即";"中断"按照"计划";,然后用summariseacrossstarts_with前缀命名为"影响"的列按"肿瘤"分组后,取每列中逻辑向量的sum得到频率计数

library(dplyr)
library(purrr)
library(stringr)
map_dfr(dplyr::lst('Interrupted', 'As.planned'), ~
df %>%
group_by(tumour) %>% 
summarise(across(starts_with('impact'), function(x)
sum( x == .x, na.rm = TRUE)), .groups = 'drop'), .id = 'impact') %>%
mutate(impact = str_remove_all(impact, '"'))
# A tibble: 2 x 7
#  impact      tumour impact.on.surgery impact.on.radiotherapy impact.on.chemotherapy impact.on.biologics impact.on.immunotherapy
#  <chr>       <chr>              <int>                  <int>                  <int>               <int>                   <int>
#1 Interrupted Breast                 0                      0                      0                   3                       0
#2 As.planned  Breast                 0                      0                      1                   1                       0

或者为了避免在值周围加引号,请使用setNames而不是lst

map_dfr(setNames(c('Interrupted', 'As.planned'),
c('Interrupted', 'As.planned')),  ~
df %>%
group_by(tumour) %>% 
summarise(across(starts_with('impact'), function(x)
sum( x == .x, na.rm = TRUE)), .groups = 'drop'), .id = 'impact')

或使用base R

lst1 <- lapply(c("Interrupted", "As.planned"), 
function(x) aggregate(.~ tumour, df, FUN = function(y)
sum(y == x, na.rm = TRUE), na.action = NULL))
data.frame(impact = c("Interrupted", "As.planned"), do.call(rbind, lst1))
#     impact tumour impact.on.surgery impact.on.radiotherapy impact.on.chemotherapy impact.on.biologics impact.on.immunotherapy
#1 Interrupted Breast                 0                      0                      0                   3                       0
#2  As.planned Breast                 0                      0                      1                   1                       0

最新更新