我有一个数据集,对于一些id,每个id有多行,但并不都像这个
Id Date Col2 Col3 Col4
1 2011-09-08 Red NA NA
2 2010-07-13 Blue NA NA
3 1997-04-23 Yellow NA NA
3 1997-04-23 Black NA NA
3 1997-04-23 Green NA Green
4 2020-01-27 Red NA NA
4 2020-01-27 Blue NA NA
5 1999-08-03 Red NA NA
5 1999-08-03 Blue NA Yellow
对于具有多行的ID,我喜欢只保留Col2
中的值等于Col4
中的值的行。如果ID有多行,并且Col2
中的值不等于Col4
中的值,则过滤器不适用。期望下面有这样的最终数据集
Id Date Col2 Col3 Col4
1 2011-09-08 Red NA NA
2 2010-07-13 Blue NA NA
3 1997-04-23 Green NA Green
4 2020-01-27 Red NA NA
4 2020-01-27 Blue NA NA
5 1999-08-03 Red NA NA
5 1999-08-03 Blue NA Yellow
到目前为止,我所做的就是这个
df_final <- df1 %>%
group_by(ID)
filter(Col2 == Col4, negate(is.na))
但这不起作用,任何关于这方面的建议都将不胜感激。
由于dplyr::filter
使用&
连接多个语句,这不是我们想要的,也不确定您要用negate(is.na)
做什么。然而,我们可以进行比较,看看是否有Col2 == Col4
,如果没有,则检查没有出现行的整个组。在进行比较时,我刚刚用""
替换了Col4
中的NA
值,因此比较返回T
和F
,但您也可以添加一个更复杂的检查,以检查两者是否为NA
。
我还创建了要单独筛选的列,以避免在单个filter
调用中两次在列之间进行比较。
df1 %>%
mutate(check = Col2 == replace_na(Col4, "")) %>%
group_by(Id) %>%
filter(check | !any(check))
#> # A tibble: 7 × 6
#> # Groups: Id [5]
#> Id Date Col2 Col3 Col4 check
#> <dbl> <date> <chr> <lgl> <chr> <lgl>
#> 1 1 2011-09-08 Red NA <NA> FALSE
#> 2 2 2010-07-13 Blue NA <NA> FALSE
#> 3 3 1997-04-23 Green NA Green TRUE
#> 4 4 2020-01-27 Red NA <NA> FALSE
#> 5 4 2020-01-27 Blue NA <NA> FALSE
#> 6 5 1999-08-03 Red NA <NA> FALSE
#> 7 5 1999-08-03 Blue NA Yellow FALSE
数据
structure(list(Id = c(1, 2, 3, 3, 3, 4, 4, 5, 5), Date = structure(c(15225,
14803, 9974, 9974, 9974, 18288, 18288, 10806, 10806), class = "Date"),
Col2 = c("Red", "Blue", "Yellow", "Black", "Green", "Red",
"Blue", "Red", "Blue"), Col3 = c(NA, NA, NA, NA, NA, NA,
NA, NA, NA), Col4 = c(NA, NA, NA, NA, "Green", NA, NA, NA,
"Yellow")), problems = structure(list(row = 7L, col = NA_character_,
expected = "5 columns", actual = "6 columns", file = "literal data"), row.names = c(NA,
-1L), class = c("tbl_df", "tbl", "data.frame")), class = c("spec_tbl_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L), spec = structure(list(
cols = list(Id = structure(list(), class = c("collector_double",
"collector")), Date = structure(list(format = ""), class = c("collector_date",
"collector")), Col2 = structure(list(), class = c("collector_character",
"collector")), Col3 = structure(list(), class = c("collector_logical",
"collector")), Col4 = structure(list(), class = c("collector_character",
"collector"))), default = structure(list(), class = c("collector_guess",
"collector")), skip = 1L), class = "col_spec"))