我有一个数据清理问题。数据收集发生了三次,有时数据输入不正确。因此,如果学生的数据被收集了不止一次,则需要复制second
数据点。
以下是我的数据集:
df <- data.frame(id = c(1,1,1, 2,2,2, 3,3, 4,4, 5),
text = c("female","male","male", "female","female","female", "male","female","male", "female", "female"),
time = c("first","second","third", "first","second","third", "first","second","second", "third", "first"))
> df
id text time
1 1 female first
2 1 male second
3 1 male third
4 2 female first
5 2 female second
6 2 female third
7 3 male first
8 3 female second
9 4 male second
10 4 female third
11 5 female first
因此id
、3和4具有不正确的性别信息。当有关于gender
变量的多个/不同输入时,我需要复制second
数据点。如果只有一个数据点,那么它应该保留在数据集中。
所需输出为
> df1
id text time
1 1 male first
2 1 male second
3 1 male third
4 2 female first
5 2 female second
6 2 female third
7 3 female first
8 3 female second
9 4 male second
10 4 male third
11 5 female first
有什么想法吗?谢谢
这只是另一种有趣的方法;
library(dplyr)
df %>%
filter(time =="second") %>%
select(-time) %>%
full_join(df, ., by ="id", suffix = c("_old", "")) %>%
mutate(text = coalesce(text, text_old)) %>%
select(names(df))
#> id text time
#> 1 1 male first
#> 2 1 male second
#> 3 1 male third
#> 4 2 female first
#> 5 2 female second
#> 6 2 female third
#> 7 3 female first
#> 8 3 female second
#> 9 4 male second
#> 10 4 male third
#> 11 5 female first
我们可以使用match
library(dplyr)
df %>%
group_by(id) %>%
mutate(text = text[match("second", time, nomatch = 1)]) %>%
ungroup
-输出
# A tibble: 11 × 3
id text time
<dbl> <chr> <chr>
1 1 male first
2 1 male second
3 1 male third
4 2 female first
5 2 female second
6 2 female third
7 3 female first
8 3 female second
9 4 male second
10 4 male third
11 5 female first
或使用coalesce
df %>%
group_by(id) %>%
mutate(text = coalesce(text[match("second", time)], text)) %>%
ungroup
-输出
# A tibble: 11 × 3
id text time
<dbl> <chr> <chr>
1 1 male first
2 1 male second
3 1 male third
4 2 female first
5 2 female second
6 2 female third
7 3 female first
8 3 female second
9 4 male second
10 4 male third
11 5 female first
使用{dplyr},我们可以使用以下方法:
- we
group_by(id)
- 在
ifelse
中检查当time == "second"
时text
中是否有元素,为此我们使用length
- 如果是这种情况,则使用
text[time == "second"]
,否则使用text
我只是想知道,如果你有三个数据实体,而first
和second
是相同的,third
是不同的,会发生什么。那么上面的方法就行不通了。
此外,如果first
是"male"
,second
是"female"
,third
又是"male"
,会发生什么。应该选哪一个?
下面的方法只使用second
(如果可用(,而忽略其余部分
library(dplyr)
df %>%
group_by(id) %>%
mutate(text = ifelse(length(text[time == "second"]) > 0,
text[time == "second"],
text))
#> # A tibble: 11 × 3
#> # Groups: id [5]
#> id text time
#> <dbl> <chr> <chr>
#> 1 1 male first
#> 2 1 male second
#> 3 1 male third
#> 4 2 female first
#> 5 2 female second
#> 6 2 female third
#> 7 3 female first
#> 8 3 female second
#> 9 4 male second
#> 10 4 male third
#> 11 5 female first
创建于2022-09-15由reprex包(v0.3.0(