r过滤器基于两列并且不丢失



我有一个数据集,对于一些id,每个id有多行,但并不都像这个

Id    Date       Col2   Col3    Col4
1     2011-09-08 Red    NA      NA
2     2010-07-13 Blue   NA      NA
3     1997-04-23 Yellow NA      NA
3     1997-04-23 Black  NA      NA
3     1997-04-23 Green  NA      Green
4     2020-01-27 Red    NA      NA
4     2020-01-27 Blue   NA      NA 
5     1999-08-03 Red    NA      NA
5     1999-08-03 Blue   NA      Yellow

对于具有多行的ID,我喜欢只保留Col2中的值等于Col4中的值的行。如果ID有多行,并且Col2中的值不等于Col4中的值,则过滤器不适用。期望下面有这样的最终数据集

Id    Date       Col2   Col3    Col4
1     2011-09-08 Red    NA      NA
2     2010-07-13 Blue   NA      NA

3     1997-04-23 Green  NA      Green

4     2020-01-27 Red    NA      NA
4     2020-01-27 Blue   NA      NA 
5     1999-08-03 Red    NA      NA
5     1999-08-03 Blue   NA      Yellow

到目前为止,我所做的就是这个

df_final <- df1 %>% 
group_by(ID)
filter(Col2 == Col4, negate(is.na))

但这不起作用,任何关于这方面的建议都将不胜感激。

由于dplyr::filter使用&连接多个语句,这不是我们想要的,也不确定您要用negate(is.na)做什么。然而,我们可以进行比较,看看是否有Col2 == Col4,如果没有,则检查没有出现行的整个组。在进行比较时,我刚刚用""替换了Col4中的NA值,因此比较返回TF,但您也可以添加一个更复杂的检查,以检查两者是否为NA

我还创建了要单独筛选的列,以避免在单个filter调用中两次在列之间进行比较。

df1 %>% 
mutate(check = Col2 == replace_na(Col4, "")) %>%
group_by(Id) %>%
filter(check | !any(check))
#> # A tibble: 7 × 6
#> # Groups:   Id [5]
#>      Id Date       Col2  Col3  Col4   check
#>   <dbl> <date>     <chr> <lgl> <chr>  <lgl>
#> 1     1 2011-09-08 Red   NA    <NA>   FALSE
#> 2     2 2010-07-13 Blue  NA    <NA>   FALSE
#> 3     3 1997-04-23 Green NA    Green  TRUE 
#> 4     4 2020-01-27 Red   NA    <NA>   FALSE
#> 5     4 2020-01-27 Blue  NA    <NA>   FALSE
#> 6     5 1999-08-03 Red   NA    <NA>   FALSE
#> 7     5 1999-08-03 Blue  NA    Yellow FALSE

数据

structure(list(Id = c(1, 2, 3, 3, 3, 4, 4, 5, 5), Date = structure(c(15225, 
14803, 9974, 9974, 9974, 18288, 18288, 10806, 10806), class = "Date"), 
Col2 = c("Red", "Blue", "Yellow", "Black", "Green", "Red", 
"Blue", "Red", "Blue"), Col3 = c(NA, NA, NA, NA, NA, NA, 
NA, NA, NA), Col4 = c(NA, NA, NA, NA, "Green", NA, NA, NA, 
"Yellow")), problems = structure(list(row = 7L, col = NA_character_, 
expected = "5 columns", actual = "6 columns", file = "literal data"), row.names = c(NA, 
-1L), class = c("tbl_df", "tbl", "data.frame")), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L), spec = structure(list(
cols = list(Id = structure(list(), class = c("collector_double", 
"collector")), Date = structure(list(format = ""), class = c("collector_date", 
"collector")), Col2 = structure(list(), class = c("collector_character", 
"collector")), Col3 = structure(list(), class = c("collector_logical", 
"collector")), Col4 = structure(list(), class = c("collector_character", 
"collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), skip = 1L), class = "col_spec"))

最新更新