分析R中复选框数据(即列计数)的最佳方法是什么?其中,每个选项都是自己的列,未选择的选项是NA



Qualtrics的调查结果对问题选择进行了编码,其中可以记录多个回答,如种族/民族人口统计(如下文(,而我无法想出一个简单的分析解决方案。它记录每个选项下每行选中的复选框(在它自己的列中(,未选中的选项保持为空。我已经决定,一个好的开始是计算非";NA";每个选项。然而,它并没有按照我的计划进行,对现有解决方案的严格搜索也没有什么用处。我找到了一种使用apply获取列计数的方法,但处理输出仍然有点笨拙。我有一个包含许多列的数据帧,这些列需要以这种方式进行分析,所以我使用grep函数来选择需要选择计数的相关列。

数据:

structure(list(race_White = c("White", NA, NA, "White", NA, NA, 
"White", "White", NA, "White", "White", "White", "White", "White", 
"White", "White", "White", "White", "White", "White", NA, "White", 
"White", "White", NA), `race_Black or African American` = c(NA, 
NA, "Black or African American", NA, NA, "Black or African American", 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, "Black or African American"), `race_American Indian or Alaska Native` = c(NA, 
NA, NA, NA, NA, NA, NA, NA, "American Indian or Alaska Native", 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
), race_Asian = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, "Asian", NA, NA, NA, NA), 
`race_Middle Eastern or North African` = c(NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_
), `race_Hispanic, Latino or Spanish` = c(NA, "Hispanic, Latino or Spanish", 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA), `race_Native Hawaiian or Pacific Islander` = c(NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_
), `race_ Prefer not to share` = c(NA, NA, NA, NA, "Prefer not to share", 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA), race_Other = c(NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, 
NA_character_, NA_character_, NA_character_), education_level = structure(c(3L, 
2L, 5L, 4L, 6L, 3L, 6L, 2L, 3L, 3L, 5L, 2L, 5L, 5L, 3L, 3L, 
5L, 2L, 5L, 5L, 5L, 3L, 3L, 3L, 5L), .Label = c("Less than high school degree", 
"High school graduate (high school diploma or equivalent)", 
"Some college but no degree", "Associate's degree (2-year)", 
"Bachelor's degree (4-year)", "Master's degree", "Doctoral/Professional degree (PhD, MD, JD)", 
"Other/Prefer not to share"), class = "factor"), age = c(74, 
43, NA, 37, 61, 64, NA, NA, 45, NA, NA, 21, NA, NA, 52, 43, 
43, NA, 65, 42, NA, 27, 35, NA, 46)), row.names = c(NA, -25L
), class = c("tbl_df", "tbl", "data.frame"))

我已经使用grep来选择我想通过以下方式计算选择的列号:

race<-c(grep("race", colnames(data)))

然后,我还使用了列名,以防公式需要名称而不是数字

racenames<-colnames(data[race])

在我创建这些选择之后,我试图得到某种不等于"0"的行计数表"使用以下(不起作用(

racecounts <- sapply(data[race],FUN = function(x){length(x[x!=""])})
racecounts

这基本上总结了列中的每一行,而不是我希望的非空行。所以我只尝试了一个简单的应用程序功能,它确实奏效了:

racecounts2 <- apply(data[race], 2, table)
racecounts2

这是有效的,然后我必须将其转换为道具。能够获得与可操作一起使用的比例

racecounts2<-prop.table(racecounts2)
racecounts2%>%
kbl() %>%
kable_material_dark()

我只是好奇是否有人找到了处理这种数据格式的替代/更好的方法?我愿意尝试任何不同的东西,这一个看起来很笨拙,它的输出还有点想象力。如果能找到一种处理这些数据的方法,让排名/绘图等更容易进行,那就太好了。

所以我只是好奇社区会怎么做

您可以使用!is.na为竞赛列计数非NA值的数量,如下所示:

colSums(!is.na(data[race]))

或者,使用dplyr语法和tidyr::pivot_longer使其看起来更像一个表:

data %>% select(starts_with("race")) %>% 
summarise(across(everything(), ~sum(!is.na(.x)))) %>% 
pivot_longer(cols=everything(), names_to = "race", values_to = "count",
names_transform = list(race = (x) str_remove(x, "race_")))
# A tibble: 9 x 2
race                                  count
<chr>                                 <int>
1 "White"                                  18
2 "Black or African American"               3
3 "American Indian or Alaska Native"        1
4 "Asian"                                   1
5 "Middle Eastern or North African"         0
6 "Hispanic, Latino or Spanish"             1
7 "Native Hawaiian or Pacific Islander"     0
8 " Prefer not to share"                    1
9 "Other"                                   0

相关内容

最新更新