如果有人以前问过这个问题,我很抱歉——我找不到。我有一个数据集,每个参与者完成的每个调查都在自己的行中。每个时间点每个参与者大约有10行。我需要每个参与者在每个时间点排一排。以下是一些测试数据:
x <- data.frame(time = rep("week_1",6), PartNum = c(1,1,1,2,2,2),
event = c(NA, "Survey_1", "Survey 2", NA, "Survey_1", "Survey 2"),
S1Q1 = c(NA,3,NA,NA,1,NA), S1Q2 = c(NA,4,NA,NA,2,NA),
S1date = c(NA,"2020-03-02",NA,NA,"2020-03-04",NA),
S2Q1 = c(NA,NA,5,NA,NA,3), S2Q2 = c(NA,NA,3,NA,NA,2),
S2date = c(NA,NA,"2020-03-02",NA,NA,"2020-03-04"),
race = c(0,NA,NA,1,NA,NA), age = c(60,NA,NA,58,NA,NA))
time PartNum event S1Q1 S1Q2 S1date S2Q1 S2Q2 S2date race age
1 week_1 1 <NA> NA NA <NA> NA NA <NA> 0 60
2 week_1 1 Survey_1 3 4 2020-03-02 NA NA <NA> NA NA
3 week_1 1 Survey 2 NA NA <NA> 5 3 2020-03-02 NA NA
4 week_1 2 <NA> NA NA <NA> NA NA <NA> 1 58
5 week_1 2 Survey_1 1 2 2020-03-04 NA NA <NA> NA NA
6 week_1 2 Survey 2 NA NA <NA> 3 2 2020-03-04 NA NA
如何将调查1和调查2以及每个参与者和时间点的人口统计数据都放在一行中(注意:为了节省空间,测试数据中只显示一个时间点(?
期望结果:
desired_x <- data.frame(time = rep("week_1",2), PartNum = c(1,2), S1Q1 = c(3,1),
S1Q2 = c(4,2), S1date = c("2020-03-02","2020-03-04"),
S2Q1 = c(5,3), S2Q2 = c(3,2),
S2date = c("2020-03-02","2020-03-04"),
race = c(0,1), age = c(60,58))
time PartNum S1Q1 S1Q2 S1date S2Q1 S2Q2 S2date race age
1 week_1 1 3 4 2020-03-02 5 3 2020-03-02 0 60
2 week_1 2 1 2 2020-03-04 3 2 2020-03-04 1 58
我在这个网站上读了很多答案,但这是我的第一个问题。感谢你这次的耐心和帮助,感谢你过去在不知不觉中给我的帮助。
已编辑:不依赖自定义函数的更简单版本
使用na.omit
仅获得有效观测值(每次/零件号(
x %>% select(-event) %>%
group_by(time, PartNum) %>%
summarise_all(na.omit)
以前的版本:
以下将使用dplyr解决您的问题:
x_clean <- x %>% # (1)
select(-event) %>% # (2)
group_by(time, PartNum) %>% # (3)
mutate(across(.cols = everything(), # (4)
.fns = getmode)) %>%
distinct() # (5)
每个步骤可以理解为执行以下操作:0(拾取数据集x
,然后
- 从数据集中删除变量
event
,THEN(读取%>%作为"然后"( - 按
time
和PartNum
分组,然后 - 在所有(分组的(变量之间变异,并获得每个(每个
time
和PartNum
。这将用每个分组中最常见的观测值来代替CCD_ 8。如果您在这里停下来,每个分组都会得到重复的行,所以最后 - 只从结果数据集中获取不同的行
- 结果数据集被分配给
x_clean
用于复制的整个代码
## your data.frame
x <- data.frame(time = rep("week_1",6), PartNum = c(1,1,1,2,2,2),
event = c(NA, "Survey_1", "Survey 2", NA, "Survey_1", "Survey 2"),
S1Q1 = c(NA,3,NA,NA,1,NA), S1Q2 = c(NA,4,NA,NA,2,NA),
S1date = c(NA,"2020-03-02",NA,NA,"2020-03-04",NA),
S2Q1 = c(NA,NA,5,NA,NA,3), S2Q2 = c(NA,NA,3,NA,NA,2),
S2date = c(NA,NA,"2020-03-02",NA,NA,"2020-03-04"),
race = c(0,NA,NA,1,NA,NA), age = c(60,NA,NA,58,NA,NA))
# helper function that works for numeric and character data
# will retrieve the most common value.
getmode <- function(v, na.rm = TRUE) {
if (na.rm) v <- na.exclude(v)
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
## solution
library(tidyverse)
x_clean <- x %>% # (0)
select(-event) %>% # (1)
group_by(time, PartNum) %>% # (2)
mutate(across(.cols = everything(), # (3)
.fns = getmode)) %>%
distinct() # (4)
x_clean
#> # A tibble: 2 x 10
#> # Groups: time, PartNum [2]
#> time PartNum S1Q1 S1Q2 S1date S2Q1 S2Q2 S2date race age
#> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 week_1 1 3 4 2020-03-02 5 3 2020-03-02 0 60
#> 2 week_1 2 1 2 2020-03-04 3 2 2020-03-04 1 58
我认为实现目标的最佳方法是首先编写一个自定义函数,只返回非NA值,然后使用dplyr
函数按时间和PartNum进行汇总。以下是使用数据的示例
##Loading dplyr package##
library(dplyr)
##Example Data##
x <- data.frame(time = rep("week_1",6), PartNum = c(1,1,1,2,2,2), event = c(NA, "Survey_1", "Survey 2", NA, "Survey_1", "Survey 2"), S1Q1 = c(NA,3,NA,NA,1,NA), S1Q2 = c(NA,4,NA,NA,2,NA), S1date = c(NA,"2020-03-02",NA,NA,"2020-03-04",NA), S2Q1 = c(NA,NA,5,NA,NA,3), S2Q2 = c(NA,NA,3,NA,NA,2), S2date = c(NA,NA,"2020-03-02",NA,NA,"2020-03-04"), race = c(0,NA,NA,1,NA,NA), age = c(60,NA,NA,58,NA,NA))
##Function to return only non-NA values##
fxn<-function(vec){
out<-vec[!is.na(vec)]
return(out)
}
##Summarizing the data using the new function##
#We'll want to get rid of the event column, hence the x[,-3]##
DF<-as.data.frame(x[,-3] %>% group_by(time, PartNum) %>% summarise_all(fxn))
##See the results##
DF
##Compare to your desired output##
y <- data.frame(time = rep("week_1",2), PartNum = c(1,2), S1Q1 = c(3,1), S1Q2 = c(4,2), S1date = c("2020-03-02","2020-03-04"), S2Q1 = c(5,3), S2Q2 = c(3,2), S2date = c("2020-03-02","2020-03-04"), race = c(0,1), age = c(60,58))
y
祝你好运!小心,-Sean