r语言 - 当列名-值对存储在列表中时筛选数据帧?



我有一个数据帧,如下所示:

df <- tibble::rownames_to_column(USArrests, "State") %>% 
tidyr::pivot_longer(cols = -State)
head(df)
# A tibble: 6 x 3
State   name     value
<chr>   <chr>    <dbl>
1 Alabama Murder    13.2
2 Alabama Assault  236  
3 Alabama UrbanPop  58  
4 Alabama Rape      21.2
5 Alaska  Murder    10  
6 Alaska  Assault  263  

在一个单独的列表对象中l我有列,我需要从数据帧中删除这些列。元素名称是列名称,值对应于我要删除的行:

l <- list(State = c("Alabama", "Pennsylvania", "Texas"),
name = c("Murder", "Assault"))

硬编码它会这样做:

dplyr::filter(df, !State %in% c("Alabama", "Pennsylvania", "Texas"), !name %in% c("Murder", "Assault"))
State      name     value
<chr>      <chr>    <dbl>
1 Alaska     UrbanPop  48  
2 Alaska     Rape      44.5
3 Arizona    UrbanPop  80  
4 Arizona    Rape      31  
5 Arkansas   UrbanPop  50  
6 Arkansas   Rape      19.5
7 California UrbanPop  91  
8 California Rape      40.6
9 Colorado   UrbanPop  78  
10 Colorado   Rape      38.7
# ... with 84 more rows

但是,l经常更改,所以我不能/不想硬编码。我尝试了以下操作,但只计算了最后一个表达式:

library(purrr)
filter_expr <- imap_chr(l, ~ paste0("! ", 
.y, 
" %in% c("", 
paste(.x, collapse = "",""), 
"")")) %>% parse(text = .)
filter(df, eval(filter_expr))
State      name     value
<chr>      <chr>    <dbl>
1 Alabama    UrbanPop  58  
2 Alabama    Rape      21.2
3 Alaska     UrbanPop  48  
4 Alaska     Rape      44.5
5 Arizona    UrbanPop  80  
6 Arizona    Rape      31  
7 Arkansas   UrbanPop  50  
8 Arkansas   Rape      19.5
9 California UrbanPop  91  
10 California Rape      40.6
# ... with 90 more rows

当过滤条件存储在像l这样对整洁更惯用的结构中时,有没有办法过滤df

我认为这个答案是这样的,但是,表达式不是动态的。

我们可以在filter循环中使用across遍历"l"的names,通过使用列名(cur_column())中的键对"l"进行子集化来创建逻辑表达式,并否定(!)。 请注意,cur_column()目前仅适用于across,而不适用于if_all/if_any(dplyr-1.0.6R 4.1.0)

library(dplyr)
df %>% 
filter(across(all_of(names(l)), ~ !. %in% l[[cur_column()]]))

-输出

# A tibble: 94 x 3
#   State      name     value
#   <chr>      <chr>    <dbl>
# 1 Alaska     UrbanPop  48  
# 2 Alaska     Rape      44.5
# 3 Arizona    UrbanPop  80  
# 4 Arizona    Rape      31  
# 5 Arkansas   UrbanPop  50  
# 6 Arkansas   Rape      19.5
# 7 California UrbanPop  91  
# 8 California Rape      40.6
# 9 Colorado   UrbanPop  78  
#10 Colorado   Rape      38.7
# … with 84 more rows

如果我们可以设置一个属性,我们就可以利用if_all

library(magrittr)
df %>% 
mutate(across(all_of(names(l)), ~ set_attr(., 'cn', cur_column()))) %>% 
filter(if_all(all_of(names(l)), ~ ! . %in% l[[attr(., 'cn')]]))
<小时 />

或带imap/reduce

library(purrr)
df %>%
filter(imap(l, ~ !cur_data()[[.y]] %in% .x) %>%
reduce(`&`))

或者另一种选择是anti_join

for(nm in names(l)) df <- anti_join(df, tibble(!! nm := l[[nm]]))

这里另一个可能的选择是使用purrr创建一个逻辑向量,该向量允许&|条件,并且可以在没有cur_column的情况下访问当前列名(.y),这只能在across内部使用:

df %>% 
filter(imap(l, ~ !df[[.y]] %in% .x) %>% reduce(`&`)) # can use magrittr::and

输出

State      name     value
<chr>      <chr>    <dbl>
1 Alaska     UrbanPop  48  
2 Alaska     Rape      44.5
3 Arizona    UrbanPop  80  
4 Arizona    Rape      31  
5 Arkansas   UrbanPop  50  
6 Arkansas   Rape      19.5
7 California UrbanPop  91  
8 California Rape      40.6
9 Colorado   UrbanPop  78  
10 Colorado   Rape      38.7
# ... with 84 more rows

or变体为:

df %>% 
filter(imap(l, ~ !df[[.y]] %in% .x) %>% reduce(`|`)) # can use magrittr::or

最新更新