r-当存在多个数据点时复制信息



我有一个数据清理问题。数据收集发生了三次,有时数据输入不正确。因此,如果学生的数据被收集了不止一次,则需要复制second数据点。

以下是我的数据集:

df <- data.frame(id = c(1,1,1, 2,2,2, 3,3,  4,4, 5),
text = c("female","male","male", "female","female","female", "male","female","male", "female", "female"),
time = c("first","second","third", "first","second","third", "first","second","second", "third", "first"))

> df
id   text   time
1   1 female  first
2   1   male second
3   1   male  third
4   2 female  first
5   2 female second
6   2 female  third
7   3   male  first
8   3 female second
9   4   male second
10  4 female  third
11  5 female  first

因此id、3和4具有不正确的性别信息。当有关于gender变量的多个/不同输入时,我需要复制second数据点。如果只有一个数据点,那么它应该保留在数据集中。

所需输出为

> df1
id   text   time
1   1   male  first
2   1   male second
3   1   male  third
4   2 female  first
5   2 female second
6   2 female  third
7   3 female  first
8   3 female second
9   4   male second
10  4   male  third
11  5 female  first

有什么想法吗?谢谢

这只是另一种有趣的方法;

library(dplyr)
df %>% 
filter(time =="second") %>% 
select(-time) %>% 
full_join(df, ., by ="id", suffix = c("_old", "")) %>% 
mutate(text = coalesce(text, text_old)) %>% 
select(names(df))
#>       id text   time  
#>  1     1 male   first 
#>  2     1 male   second
#>  3     1 male   third 
#>  4     2 female first 
#>  5     2 female second
#>  6     2 female third 
#>  7     3 female first 
#>  8     3 female second
#>  9     4 male   second
#> 10     4 male   third 
#> 11     5 female first

我们可以使用match

library(dplyr)
df %>% 
group_by(id) %>%
mutate(text = text[match("second", time, nomatch = 1)]) %>%
ungroup

-输出

# A tibble: 11 × 3
id text   time  
<dbl> <chr>  <chr> 
1     1 male   first 
2     1 male   second
3     1 male   third 
4     2 female first 
5     2 female second
6     2 female third 
7     3 female first 
8     3 female second
9     4 male   second
10     4 male   third 
11     5 female first 

或使用coalesce

df %>% 
group_by(id) %>%
mutate(text = coalesce(text[match("second", time)], text)) %>%
ungroup

-输出

# A tibble: 11 × 3
id text   time  
<dbl> <chr>  <chr> 
1     1 male   first 
2     1 male   second
3     1 male   third 
4     2 female first 
5     2 female second
6     2 female third 
7     3 female first 
8     3 female second
9     4 male   second
10     4 male   third 
11     5 female first 

使用{dplyr},我们可以使用以下方法:

  1. wegroup_by(id)
  2. ifelse中检查当time == "second"text中是否有元素,为此我们使用length
  3. 如果是这种情况,则使用text[time == "second"],否则使用text

我只是想知道,如果你有三个数据实体,而firstsecond是相同的,third是不同的,会发生什么。那么上面的方法就行不通了。

此外,如果first"male"second"female"third又是"male",会发生什么。应该选哪一个?

下面的方法只使用second(如果可用(,而忽略其余部分

library(dplyr)

df %>% 
group_by(id) %>% 
mutate(text = ifelse(length(text[time == "second"]) > 0,
text[time == "second"],
text))
#> # A tibble: 11 × 3
#> # Groups:   id [5]
#>       id text   time  
#>    <dbl> <chr>  <chr> 
#>  1     1 male   first 
#>  2     1 male   second
#>  3     1 male   third 
#>  4     2 female first 
#>  5     2 female second
#>  6     2 female third 
#>  7     3 female first 
#>  8     3 female second
#>  9     4 male   second
#> 10     4 male   third 
#> 11     5 female first

创建于2022-09-15由reprex包(v0.3.0(

最新更新