如何按参与者编号和时间点将交错的数据折叠到R中的一行



如果有人以前问过这个问题,我很抱歉——我找不到。我有一个数据集,每个参与者完成的每个调查都在自己的行中。每个时间点每个参与者大约有10行。我需要每个参与者在每个时间点排一排。以下是一些测试数据:

x <- data.frame(time = rep("week_1",6), PartNum = c(1,1,1,2,2,2),
event = c(NA, "Survey_1", "Survey 2", NA, "Survey_1", "Survey 2"),
S1Q1 = c(NA,3,NA,NA,1,NA), S1Q2 = c(NA,4,NA,NA,2,NA),
S1date = c(NA,"2020-03-02",NA,NA,"2020-03-04",NA),
S2Q1 = c(NA,NA,5,NA,NA,3), S2Q2 = c(NA,NA,3,NA,NA,2),
S2date = c(NA,NA,"2020-03-02",NA,NA,"2020-03-04"),
race = c(0,NA,NA,1,NA,NA), age = c(60,NA,NA,58,NA,NA))
time PartNum    event S1Q1 S1Q2     S1date S2Q1 S2Q2     S2date race age
1 week_1       1     <NA>   NA   NA       <NA>   NA   NA       <NA>    0  60
2 week_1       1 Survey_1    3    4 2020-03-02   NA   NA       <NA>   NA  NA
3 week_1       1 Survey 2   NA   NA       <NA>    5    3 2020-03-02   NA  NA
4 week_1       2     <NA>   NA   NA       <NA>   NA   NA       <NA>    1  58
5 week_1       2 Survey_1    1    2 2020-03-04   NA   NA       <NA>   NA  NA
6 week_1       2 Survey 2   NA   NA       <NA>    3    2 2020-03-04   NA  NA

如何将调查1和调查2以及每个参与者和时间点的人口统计数据都放在一行中(注意:为了节省空间,测试数据中只显示一个时间点(?

期望结果:

desired_x <- data.frame(time = rep("week_1",2), PartNum = c(1,2), S1Q1 = c(3,1),
S1Q2 = c(4,2), S1date = c("2020-03-02","2020-03-04"),
S2Q1 = c(5,3), S2Q2 = c(3,2),
S2date = c("2020-03-02","2020-03-04"),
race = c(0,1), age = c(60,58))
time PartNum S1Q1 S1Q2     S1date S2Q1 S2Q2     S2date race age
1 week_1       1    3    4 2020-03-02    5    3 2020-03-02    0  60
2 week_1       2    1    2 2020-03-04    3    2 2020-03-04    1  58

我在这个网站上读了很多答案,但这是我的第一个问题。感谢你这次的耐心和帮助,感谢你过去在不知不觉中给我的帮助。

已编辑:不依赖自定义函数的更简单版本

使用na.omit仅获得有效观测值(每次/零件号(

x %>% select(-event) %>% 
group_by(time, PartNum) %>% 
summarise_all(na.omit)

以前的版本:

以下将使用dplyr解决您的问题:

x_clean <- x %>%                       # (1)
select(-event) %>%                   # (2)  
group_by(time, PartNum) %>%          # (3)
mutate(across(.cols = everything(),  # (4)
.fns = getmode)) %>% 
distinct()                           # (5)

每个步骤可以理解为执行以下操作:0(拾取数据集x,然后

  1. 从数据集中删除变量event,THEN(读取%>%作为"然后"(
  2. timePartNum分组,然后
  3. 在所有(分组的(变量之间变异,并获得每个(每个timePartNum。这将用每个分组中最常见的观测值来代替CCD_ 8。如果您在这里停下来,每个分组都会得到重复的行,所以最后
  4. 只从结果数据集中获取不同的行
  5. 结果数据集被分配给x_clean

用于复制的整个代码

## your data.frame
x <- data.frame(time = rep("week_1",6), PartNum = c(1,1,1,2,2,2),
event = c(NA, "Survey_1", "Survey 2", NA, "Survey_1", "Survey 2"),
S1Q1 = c(NA,3,NA,NA,1,NA), S1Q2 = c(NA,4,NA,NA,2,NA),
S1date = c(NA,"2020-03-02",NA,NA,"2020-03-04",NA),
S2Q1 = c(NA,NA,5,NA,NA,3), S2Q2 = c(NA,NA,3,NA,NA,2),
S2date = c(NA,NA,"2020-03-02",NA,NA,"2020-03-04"),
race = c(0,NA,NA,1,NA,NA), age = c(60,NA,NA,58,NA,NA))

# helper function that works for numeric and character data
# will retrieve the most common value. 
getmode <- function(v, na.rm = TRUE) {
if (na.rm) v <- na.exclude(v)
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
## solution 
library(tidyverse)
x_clean <- x %>%                       # (0)
select(-event) %>%                   # (1)  
group_by(time, PartNum) %>%          # (2)
mutate(across(.cols = everything(),  # (3)
.fns = getmode)) %>% 
distinct()                           # (4)
x_clean
#> # A tibble: 2 x 10
#> # Groups:   time, PartNum [2]
#>   time   PartNum  S1Q1  S1Q2 S1date      S2Q1  S2Q2 S2date      race   age
#>   <chr>    <dbl> <dbl> <dbl> <chr>      <dbl> <dbl> <chr>      <dbl> <dbl>
#> 1 week_1       1     3     4 2020-03-02     5     3 2020-03-02     0    60
#> 2 week_1       2     1     2 2020-03-04     3     2 2020-03-04     1    58

我认为实现目标的最佳方法是首先编写一个自定义函数,只返回非NA值,然后使用dplyr函数按时间和PartNum进行汇总。以下是使用数据的示例

##Loading dplyr package##
library(dplyr)
##Example Data## 
x <- data.frame(time = rep("week_1",6), PartNum = c(1,1,1,2,2,2), event = c(NA, "Survey_1", "Survey 2", NA, "Survey_1", "Survey 2"), S1Q1 = c(NA,3,NA,NA,1,NA), S1Q2 = c(NA,4,NA,NA,2,NA), S1date = c(NA,"2020-03-02",NA,NA,"2020-03-04",NA), S2Q1 = c(NA,NA,5,NA,NA,3), S2Q2 = c(NA,NA,3,NA,NA,2), S2date = c(NA,NA,"2020-03-02",NA,NA,"2020-03-04"), race = c(0,NA,NA,1,NA,NA), age = c(60,NA,NA,58,NA,NA))
##Function to return only non-NA values##
fxn<-function(vec){
out<-vec[!is.na(vec)]
return(out)
}
##Summarizing the data using the new function##
#We'll want to get rid of the event column, hence the x[,-3]##
DF<-as.data.frame(x[,-3] %>% group_by(time, PartNum) %>% summarise_all(fxn))
##See the results##
DF
##Compare to your desired output##
y <- data.frame(time = rep("week_1",2), PartNum = c(1,2), S1Q1 = c(3,1), S1Q2 = c(4,2), S1date = c("2020-03-02","2020-03-04"), S2Q1 = c(5,3), S2Q2 = c(3,2), S2date = c("2020-03-02","2020-03-04"), race = c(0,1), age = c(60,58))
y

祝你好运!小心,-Sean

最新更新