r-使用Tidyverse来计算REDCap数据中多列中特定字符串的频率



我经常收到REDCap调查的数据,在这些调查中,受访者被允许";检查">1对调查问题的回答。每个潜在的响应都包含在自己的列中。我想总结一下检查每个响应选项(列(的频率。例如:

library(tidyverse)
set.seed(1234)
responses<-c("Checked", "Unchecked")
numobs<-10
my_example<-data.frame(id=1:10, 
Response_Option_A=sample(responses, numobs, replace=TRUE), 
Response_Option_B=sample(responses, numobs, replace=TRUE), 
Response_Option_C=sample(responses, numobs, replace=TRUE),
Response_Option_D=sample(responses, numobs, replace=TRUE),
stringsAsFactors = FALSE)
my_example
#>    id Response_Option_A Response_Option_B Response_Option_C Response_Option_D
#> 1   1         Unchecked         Unchecked         Unchecked           Checked
#> 2   2         Unchecked         Unchecked         Unchecked         Unchecked
#> 3   3         Unchecked         Unchecked         Unchecked           Checked
#> 4   4         Unchecked           Checked         Unchecked           Checked
#> 5   5           Checked         Unchecked         Unchecked           Checked
#> 6   6         Unchecked         Unchecked         Unchecked         Unchecked
#> 7   7           Checked         Unchecked           Checked           Checked
#> 8   8           Checked           Checked         Unchecked         Unchecked
#> 9   9           Checked         Unchecked         Unchecked         Unchecked
#> 10 10         Unchecked         Unchecked         Unchecked           Checked

我最初倾向于尝试这个,但它返回的是检查的回复总数,而不是每列中的数字。

my_example %>%
select(starts_with("Response_Option_")) %>%
summarise(checked=sum(.=="Checked"))
#>   checked
#> 1      13

创建于2020-08-10由reprex包(v0.3.0(

感谢您帮助有效地总结这些回复。

这是一种tidyverse方法,用于按列而不是按行显示响应总数。我认为,从你的问题措辞来看,这就是你想要的。还包括starts_with()函数,该函数包含在您的问题标签中。

我们可以使用pivot_longer()将响应特征从宽转换为长,然后使用group_by定义变量,将现有表转换为分组表,其中summarise((操作用于创建新的数据帧,其中为分组变量的每个组合提供行。

library(tidyverse)
set.seed(1234)
responses<-c("Checked", "Unchecked")
numobs<-10
my_example<-data.frame(id=1:10, 
Response_Option_A=sample(responses, numobs, replace=TRUE), 
Response_Option_B=sample(responses, numobs, replace=TRUE), 
Response_Option_C=sample(responses, numobs, replace=TRUE),
Response_Option_D=sample(responses, numobs, replace=TRUE),
stringsAsFactors = FALSE)
my_example %>% 
pivot_longer(starts_with("Response_"), names_to = "Responses", 
values_to = "value") %>% 
group_by(Responses, value) %>%
summarise(total_responses = n())

#> # A tibble: 8 x 3
#> # Groups:   Responses [4]
#>   Responses         value     total_responses
#>   <chr>             <chr>               <int>
#> 1 Response_Option_A Checked                 4
#> 2 Response_Option_A Unchecked               6
#> 3 Response_Option_B Checked                 2
#> 4 Response_Option_B Unchecked               8
#> 5 Response_Option_C Checked                 1
#> 6 Response_Option_C Unchecked               9
#> 7 Response_Option_D Checked                 6
#> 8 Response_Option_D Unchecked               4

创建于2020-08-10由reprex包(v0.3.0(

如果您只想要Checked响应,可以在summarise()操作之后添加以下代码行:

filter(value == "Checked")
#> # A tibble: 4 x 3
#> # Groups:   Responses [4]
#>   Responses         value   total_responses
#>   <chr>             <chr>             <int>
#> 1 Response_Option_A Checked               4
#> 2 Response_Option_B Checked               2
#> 3 Response_Option_C Checked               1
#> 4 Response_Option_D Checked               6

检查tidyREDCap包。它有一组函数来帮助处理检查所有来自REDCap的应用变量。该包在CRAN上,github.io上的网站将文章中的小插曲放在页面顶部。

您可以将summariseacross:一起使用

library(dplyr)
my_example %>%
summarise(across(starts_with("Response_Option_"), ~sum(. == 'Checked')))
#  Response_Option_A Response_Option_B Response_Option_C Response_Option_D
#1                 4                 2                 1                 6

在旧版本的dplyr中,您可以使用summarise_at:

my_example %>%
summarise_at(vars(starts_with("Response_Option_")), ~sum(. == 'Checked'))

一个非常base R的解决方案是:

my_example$checked <- apply(my_example[,which(grepl('Response_Option_',names(my_example)))],1,
function(x) length(which(x=="Checked")))

输出:

id Response_Option_A Response_Option_B Response_Option_C Response_Option_D checked
1   1         Unchecked         Unchecked         Unchecked           Checked       1
2   2         Unchecked         Unchecked         Unchecked         Unchecked       0
3   3         Unchecked         Unchecked         Unchecked           Checked       1
4   4         Unchecked           Checked         Unchecked           Checked       2
5   5           Checked         Unchecked         Unchecked           Checked       2
6   6         Unchecked         Unchecked         Unchecked         Unchecked       0
7   7           Checked         Unchecked           Checked           Checked       3
8   8           Checked           Checked         Unchecked         Unchecked       2
9   9           Checked         Unchecked         Unchecked         Unchecked       1
10 10         Unchecked         Unchecked         Unchecked           Checked       1

也是@r2evans:信用的最佳方式

my_example$checked <- rowSums(my_example[, grep("^Response_", colnames(my_example))] == "Checked")

它产生了相同的先前输出,并且可读性更强。

相关内容

  • 没有找到相关文章

最新更新