过滤特定字符串后的所有列

  • 本文关键字:字符串 过滤 r
  • 更新时间 :
  • 英文 :


如果我能在任何列中找到字符串Code,我的目标是过滤我的数据。它应该在这个单词第一次出现时进行过滤。

我正在处理一个列表,包含Code的列按其排列变化。所以我需要一个通用的方法来做这个,比如尝试。

一些示例数据:

# A tibble: 11 × 10
`Title:`            `A&E weekly activity statistics, NHS…` ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10
<chr>               <chr>                                  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 Contact:            Paul Steele - Unify2@dh.gsi.gov.uk     NA    NA    NA    NA    NA    NA    NA    NA   
2 NA                  NA                                     NA    NA    NA    NA    NA    NA    NA    NA   
3 Provider Level Data NA                                     NA    NA    NA    NA    NA    NA    NA    NA   
4 NA                  NA                                     NA    A&E … NA    NA    NA    A&E … NA    NA   
5 SHA                 Code                                   Name  Type… Type… Type… Tota… Type… Type… Type…
6 -                   -                                      Engl… 2831… 12906 1369… 4330… 15347 34    172  
7 NA                  NA                                     NA    NA    NA    NA    NA    NA    NA    NA   
8 Q30                 RLN                                    City… 1423  669   297   2389  202   0     18   
9 Q30                 RXP                                    Coun… 2473  0     2088  4561  89    0     0    
10 Q30                 5J9                                    Darl… 0     0     0     0     0     0     0    
11 Q30                 RR7                                    Gate… 1251  0     0     1251  24    0     0  

我的尝试:

vf %>% filter(row_number() >= across(everything(), ~ .x %>% which(. == 'Code')))

我得到这些错误:

Error in `filter()`:
! Problem while computing `..1 = row_number() >= ...`.
Caused by error in `across()`:
! Problem while computing column `Title:`.
Caused by error in `which()`:
! argument to 'which' is not logical
Run `rlang::last_error()` to see where the error occurred.

预期输出:

# A tibble: 7 × 10
`Title:` `A&E weekly activity statistics, NHS and indepen…` ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10
<chr>    <chr>                                              <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 SHA      Code                                               Name  Type… Type… Type… Tota… Type… Type… Type…
2 -        -                                                  Engl… 2831… 12906 1369… 4330… 15347 34    172  
3 NA       NA                                                 NA    NA    NA    NA    NA    NA    NA    NA   
4 Q30      RLN                                                City… 1423  669   297   2389  202   0     18   
5 Q30      RXP                                                Coun… 2473  0     2088  4561  89    0     0    
6 Q30      5J9                                                Darl… 0     0     0     0     0     0     0    
7 Q30      RR7                                                Gate… 1251  0     0     1251  24    0     0    

复制代码:

structure(list(`Title:` = c("Contact:", NA, "Provider Level Data", 
NA, "SHA", "-", NA, "Q30", "Q30", "Q30", "Q30"), `A&E weekly activity statistics, NHS and independent sector organisations in England` = c("Paul Steele - Unify2@dh.gsi.gov.uk", 
NA, NA, NA, "Code", "-", NA, "RLN", "RXP", "5J9", "RR7"), ...3 = c(NA, 
NA, NA, NA, "Name", "England", NA, "City Hospitals Sunderland NHS Foundation Trust", 
"County Durham And Darlington NHS Foundation Trust", "Darlington PCT", 
"Gateshead Health NHS Foundation Trust"), ...4 = c(NA, NA, NA, 
"A&E attendances", "Type 1 Departments - Major A&E", "283175", 
NA, "1423", "2473", "0", "1251"), ...5 = c(NA, NA, NA, NA, "Type 2 Departments - Single Specialty", 
"12906", NA, "669", "0", "0", "0"), ...6 = c(NA, NA, NA, NA, 
"Type 3 Departments - Other A&E/Minor Injury Unit", "136985", 
NA, "297", "2088", "0", "0"), ...7 = c(NA, NA, NA, NA, "Total attendances", 
"433066", NA, "2389", "4561", "0", "1251"), ...8 = c(NA, NA, 
NA, "A&E attendances > 4 hours from arrival to admission, transfer or discharge", 
"Type 1 Departments - Major A&E", "15347", NA, "202", "89", "0", 
"24"), ...9 = c(NA, NA, NA, NA, "Type 2 Departments - Single Specialty", 
"34", NA, "0", "0", "0", "0"), ...10 = c(NA, NA, NA, NA, "Type 3 Departments - Other A&E/Minor Injury Unit", 
"172", NA, "18", "0", "0", "0")), row.names = c(NA, -11L), class = c("tbl_df", 
"tbl", "data.frame"))

下面的尝试似乎得到了我想要的:

data %>% filter(row_number() >= which(keep(.,~any(which(.=='Code'))) == 'Code'))

然而,我相信有比这更干净的尝试!

我做了什么:

  1. 筛选大于或等于任何列中包含Code的逻辑向量的行号。
  2. keep保留Code所在的列,因此filter可以继续工作。
  3. 输出:

# A tibble: 7 × 10
`Title:` `A&E weekly activity statistics, NHS and indepen…` ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10
<chr>    <chr>                                              <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 SHA      Code                                               Name  Type… Type… Type… Tota… Type… Type… Type…
2 -        -                                                  Engl… 2831… 12906 1369… 4330… 15347 34    172  
3 NA       NA                                                 NA    NA    NA    NA    NA    NA    NA    NA   
4 Q30      RLN                                                City… 1423  669   297   2389  202   0     18   
5 Q30      RXP                                                Coun… 2473  0     2088  4561  89    0     0    
6 Q30      5J9                                                Darl… 0     0     0     0     0     0     0    
7 Q30      RR7                                                Gate… 1251  0     0     1251  24    0     0   

有一个函数cumany可以帮你完成任务。

对于第一次出现"Code"之后的所有情况,它都返回TRUE

df |> 
filter(if_any(everything(), ~ cumany(.x %in% "Code")))
`Title:` `A&E weekly activity statistics, NHS and independent sector organisations in England` ...3          ...4  ...5  ...6  ...7  ...8  ...9  ...10
<chr>    <chr>                                                                                 <chr>         <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 SHA      Code                                                                                  Name          Type… Type… Type… Tota… Type… Type… Type…
2 -        -                                                                                     England       2831… 12906 1369… 4330… 15347 34    172  
3 NA       NA                                                                                    NA            NA    NA    NA    NA    NA    NA    NA   
4 Q30      RLN                                                                                   City Hospita… 1423  669   297   2389  202   0     18   
5 Q30      RXP                                                                                   County Durha… 2473  0     2088  4561  89    0     0    
6 Q30      5J9                                                                                   Darlington P… 0     0     0     0     0     0     0    
7 Q30      RR7                                                                                   Gateshead He… 1251  0     0     1251  24    0     0 

您可以使用str_detect来检测"Code"across所有列与if_any相结合,看看是否有任何列与字符串,并删除所有行之前的特定行,像这样(感谢@Darren Tsai):

library(stringr)
library(dplyr)
vf %>% 
filter(row_number() >= which.max(if_any(everything(), str_detect, "Code")))
#> # A tibble: 7 × 10
#>   `Title:` `A&E weekly activit…` ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10
#>   <chr>    <chr>                 <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 SHA      Code                  Name  Type… Type… Type… Tota… Type… Type… Type…
#> 2 -        -                     Engl… 2831… 12906 1369… 4330… 15347 34    172  
#> 3 <NA>     <NA>                  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 4 Q30      RLN                   City… 1423  669   297   2389  202   0     18   
#> 5 Q30      RXP                   Coun… 2473  0     2088  4561  89    0     0    
#> 6 Q30      5J9                   Darl… 0     0     0     0     0     0     0    
#> 7 Q30      RR7                   Gate… 1251  0     0     1251  24    0     0

由reprex包(v2.0.1)创建于2022-07-01

老答:

您可以使用str_detect来检测"Code"across所有列,并删除特定行之前的所有行,如下所示:

library(stringr)
library(dplyr)
vf %>% 
filter(row_number() > if_all(everything(), ~!str_detect(., "Code")))
#> # A tibble: 6 × 10
#>   `Title:` `A&E weekly activit…` ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10
#>   <chr>    <chr>                 <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 SHA      Code                  Name  Type… Type… Type… Tota… Type… Type… Type…
#> 2 -        -                     Engl… 2831… 12906 1369… 4330… 15347 34    172  
#> 3 Q30      RLN                   City… 1423  669   297   2389  202   0     18   
#> 4 Q30      RXP                   Coun… 2473  0     2088  4561  89    0     0    
#> 5 Q30      5J9                   Darl… 0     0     0     0     0     0     0    
#> 6 Q30      RR7                   Gate… 1251  0     0     1251  24    0     0

由reprex包(v2.0.1)创建于2022-07-01

最新更新