在purrr::map((中使用dplyr::count((时出错
我想要按行子集计数的唯一字符值的数据帧完整的数据集是1000多行,许多肿瘤类型
玩具示例:
library(tidyverse)
df <- tibble::tribble(
~tumour, ~impact.on.surgery, ~impact.on.radiotherapy, ~impact.on.chemotherapy, ~impact.on.biologics, ~impact.on.immunotherapy,
'Breast', NA, NA, NA, 'Interrupted', NA,
'Breast', NA, NA, NA, 'As.planned', NA,
'Breast', NA, NA, NA, 'Interrupted', NA,
'Breast', NA, NA, 'As.planned', NA, NA,
'Breast', NA, NA, NA, NA, NA,
'Breast', NA, NA, NA, 'Interrupted', NA
> df
# A tibble: 6 x 6
tumour impact.on.surgery impact.on.radiotherapy impact.on.chemotherapy impact.on.biologics impact.on.immunotherapy
<chr> <lgl> <lgl> <chr> <chr> <lgl>
1 Breast NA NA NA Interrupted NA
2 Breast NA NA NA As.planned NA
3 Breast NA NA NA Interrupted NA
4 Breast NA NA As.planned NA NA
5 Breast NA NA NA NA NA
6 Breast NA NA NA Interrupted NA
)
所需输出:理想情况下,作为按肿瘤类型命名的数据帧列表,因此我可以稍后reduce(bind_rows, .id = 'tumour')
添加.id
列标签
$ Breast
# A tibble: 2 x 6
impact impact.on.surgery impact.on.radiotherapy impact.on.chemotherapy impact.on.biologics impact.on.immunotherapy
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Interrupted 0 0 0 3 0
2 As.planned 0 0 1 1 0
迄今为止已尝试:
# Gets single row tibble, but not sure how to `.id` label each row, map across all values & bind
df %>%
summarise(across(starts_with('impact'), ~sum(str_count(.x, 'As.planned'), na.rm = T)))
# A tibble: 1 x 5
impact.on.surgery impact.on.radiotherapy impact.on.chemotherapy impact.on.biologics impact.on.immunotherapy
<int> <int> <int> <int> <int>
1 0 0 1 1 0
# ?Counts all variable values (no need to specify), simpler code, but also counts `NAs` and I can't pivot that to a wide form as it has 'counted' the tumour
df %>%
map_dfr(~count(data.frame(x=.), x), .id = 'var')
var x n
1 tumour Breast 6
2 impact.on.surgery <NA> 6
3 impact.on.radiotherapy <NA> 6
4 impact.on.chemotherapy As.planned 1
5 impact.on.chemotherapy <NA> 5
6 impact.on.biologics As.planned 1
7 impact.on.biologics Interrupted 3
8 impact.on.biologics <NA> 2
9 impact.on.immunotherapy <NA> 6
map
的一个选项是在要计数的元素上循环,即";"中断"按照"计划";,然后用summarise
across
将starts_with
前缀命名为"影响"的列按"肿瘤"分组后,取每列中逻辑向量的sum
得到频率计数
library(dplyr)
library(purrr)
library(stringr)
map_dfr(dplyr::lst('Interrupted', 'As.planned'), ~
df %>%
group_by(tumour) %>%
summarise(across(starts_with('impact'), function(x)
sum( x == .x, na.rm = TRUE)), .groups = 'drop'), .id = 'impact') %>%
mutate(impact = str_remove_all(impact, '"'))
# A tibble: 2 x 7
# impact tumour impact.on.surgery impact.on.radiotherapy impact.on.chemotherapy impact.on.biologics impact.on.immunotherapy
# <chr> <chr> <int> <int> <int> <int> <int>
#1 Interrupted Breast 0 0 0 3 0
#2 As.planned Breast 0 0 1 1 0
或者为了避免在值周围加引号,请使用setNames
而不是lst
map_dfr(setNames(c('Interrupted', 'As.planned'),
c('Interrupted', 'As.planned')), ~
df %>%
group_by(tumour) %>%
summarise(across(starts_with('impact'), function(x)
sum( x == .x, na.rm = TRUE)), .groups = 'drop'), .id = 'impact')
或使用base R
lst1 <- lapply(c("Interrupted", "As.planned"),
function(x) aggregate(.~ tumour, df, FUN = function(y)
sum(y == x, na.rm = TRUE), na.action = NULL))
data.frame(impact = c("Interrupted", "As.planned"), do.call(rbind, lst1))
# impact tumour impact.on.surgery impact.on.radiotherapy impact.on.chemotherapy impact.on.biologics impact.on.immunotherapy
#1 Interrupted Breast 0 0 0 3 0
#2 As.planned Breast 0 0 1 1 0