r-提取大写行并向下填充,直到下一个大写行



我有一些数据,看起来像:

RegionName
<chr>     
1 ANDALUCÍA 
2 Almería   
3 Abla      
4 Abrucena  
5 Adra      
6 ALBÁNCHEZ 
7 Alboloduy 
8 Albox     
9 ALCOLEA   
10 Alcóntar

其中一些列是uppercase。我想将大写列提取到一个新列和fill(down)中,直到下一个大写列为止。

预期输出:

RegionName REGIONNAME
<chr>        <chr>
1 ANDALUCÍA   ANDALUCÍA   -first result
2 Almería     ANDALUCÍA
3 Abla        ANDALUCÍA
4 Abrucena    ANDALUCÍA
5 Adra        ANDALUCÍA
6 ALBÁNCHEZ   ALBÁNCHEZ  - change here
7 Alboloduy   ALBÁNCHEZ
8 Albox       ALBÁNCHEZ
9 ALCOLEA     ALCOLEA    - change here
10 Alcóntar    ALCOLEA

数据:

data = structure(list(RegionName = c("ANDALUCÍA", "Almería", "Abla", 
"Abrucena", "Adra", "ALBÁNCHEZ", "Alboloduy", "Albox", "ALCOLEA", 
"Alcóntar")), row.names = c(NA, -10L), class = c("tbl_df", "tbl", 
"data.frame"))

一个想法是使用grepl()来识别[[:upper:]],将其他的转换为NA和fill(),即

library(dplyr)
library(tidyr)
data %>% 
mutate(new = replace(RegionName, !grepl("^[[:upper:]]+$", RegionName), NA)) %>% 
fill(new)
# A tibble: 10 x 2
RegionName new      
<chr>      <chr>    
1 ANDALUCÍA  ANDALUCÍA
2 Almería    ANDALUCÍA
3 Abla       ANDALUCÍA
4 Abrucena   ANDALUCÍA
5 Adra       ANDALUCÍA
6 ALBÁNCHEZ  ALBÁNCHEZ
7 Alboloduy  ALBÁNCHEZ
8 Albox      ALBÁNCHEZ
9 ALCOLEA    ALCOLEA  
10 Alcóntar   ALCOLEA 

您可以根据区域的名称是否为大写的==来将区域分组在一起。然后将组中的所有名称设置为firstRegionName,该名称为全大写。

library(tidyverse) 
df %>%
group_by(grp = cumsum(RegionName == toupper(RegionName))) %>%
mutate(REGIONNAME = first(RegionName))

输出

RegionName   grp REGIONNAME
<chr>      <int> <chr>     
1 ANDALUCÍA      1 ANDALUCÍA 
2 Almería        1 ANDALUCÍA 
3 Abla           1 ANDALUCÍA 
4 Abrucena       1 ANDALUCÍA 
5 Adra           1 ANDALUCÍA 
6 ALBÁNCHEZ      2 ALBÁNCHEZ 
7 Alboloduy      2 ALBÁNCHEZ 
8 Albox          2 ALBÁNCHEZ 
9 ALCOLEA        3 ALCOLEA   
10 Alcóntar       3 ALCOLEA 

数据

df <- structure(list(RegionName = c("ANDALUCÍA", "Almería", "Abla", 
"Abrucena", "Adra", "ALBÁNCHEZ", "Alboloduy", "Albox", "ALCOLEA", 
"Alcóntar")), class = "data.frame", row.names = c("1", "2", 
"3", "4", "5", "6", "7", "8", "9", "10"))

具有ifelsefill的替代方案:

library(tidyverse)
df %>% 
mutate(REGIONNAME = ifelse(RegionName == toupper(RegionName), RegionName, NA)) %>% 
fill(REGIONNAME)
RegionName REGIONNAME
1   ANDALUCÍA  ANDALUCÍA
2     Almería  ANDALUCÍA
3        Abla  ANDALUCÍA
4    Abrucena  ANDALUCÍA
5        Adra  ANDALUCÍA
6   ALBÁNCHEZ  ALBÁNCHEZ
7   Alboloduy  ALBÁNCHEZ
8       Albox  ALBÁNCHEZ
9     ALCOLEA    ALCOLEA
10   Alcóntar    ALCOLEA

最新更新