r语言 - 如何创建根据指定字典重命名列名的函数?



我正在创建一个数据分析包,从各种来源抓取数据。 在许多情况下,列的名称对于每个数据源中相同类型的数据具有不一致的命名。

就我而言,我希望创建一个重命名函数,该函数调用格式化为tbl_df的字典,并重命名列。

当所有原始列都有新的列名时,此示例有效,但当tbl_df中存在不属于字典的其他列时,则不适用

library(tidyverse)
df1 <- tibble::tribble(
~name, ~birthday, ~height,
"John Smith", "01/20/1990", "5'10"
)
df2 <- tibble::tribble(
~name, ~score, ~grade,
"John Smith", 95, 8
)

# column renaming dictionaries
people_dictionary <- tibble::tribble(
~namePeople, ~nameActual,
"name", "studentName",
"birthday", "dob",
"height", "height"
)
test_dictionary <- tibble::tribble(
~nameTest, ~nameActual,
"name", "studentName",
"score", "examScore",
"grade", "schoolGrade"
)
rename_function <- function(data, data_source = "people") {
# find dictionary based on data source
if (data_source == "people") {
actual_names_df <- people_dictionary
}
if (data_source == "test") {
actual_names_df <- test_dictionary
}
# get column names of data
original_names <- colnames(data)
# create column name filter depending on data source
name_columns <- case_when(
data_source == "people" ~ "namePeople",
data_source == "people" ~ "nameTest"
)
# Match Original Names to Renamed Column Names
actual_names <-
seq_along(original_names) %>%
purrr::map_chr(function(x) {
actual <-
actual_names_df %>%
# rlang used to unquote dynamic name column
filter((!!rlang::sym(name_columns)) == original_names[x]) %>%
.$nameActual
})
# rename columns
data <- data %>%
purrr::set_names(actual_names)
}
renamed_df1 <- df1 %>% rename_function(data_source = "people")
# original df
df1
#> # A tibble: 1 x 3
#>   name       birthday   height
#>   <chr>      <chr>      <chr> 
#> 1 John Smith 01/20/1990 5'10
# renamed columns of df1
renamed_df1
#> # A tibble: 1 x 3
#>   studentName dob        height
#>   <chr>       <chr>      <chr> 
#> 1 John Smith  01/20/1990 5'10
# additional column not named in dictionary
df3 <- tibble::tribble(
~name, ~birthday, ~height, ~weight,
"John Smith", "01/20/1990", "5'10", 165
)
df3 %>% rename_function(data_source = "people")
#> Error: Result 4 must be a single string, not a character vector of length 0

创建于 2020-02-06 由 reprex 软件包 (v0.3.0(

我认为我的重命名功能有几部分可以改进:

  1. 有没有更好的方法来调用正确的字典(tbl_df(,而不是大量的if语句? 我是否可以创建一个tbl_dfcsv,其中包含一个用于data_source的列和另一个列出包中tbl_df名称的列?
tibble::tribble(
~data_source,    ~dictionary_name,
"people", "people_dictionary",
"test",   "test_dictionary"
)
    如何
  1. 重新设计我的函数以遵循类似的工作流程,但在字典中所有列都没有新名称时不出现重命名错误?
  2. 对于我的用例,是否有更好的过程将"字典"存储在r包中?

我认为您可以为所有导入的数据集使用一个字典。 一个技巧是使用带有rename的新旧变量名称的命名向量。您仍然可以将字典放在tibble中,并使用tibble::deframe创建命名向量

[请注意,命名向量应c(new1 = old1, new2 = old2,..)

library(tidyverse)
# Our dictionary
dictionary <- tibble::tibble(
new = c("studentName", "dob", "height", "examScore", "schoolGrade"),
old = c("name", "birthday", "height", "score", "grade")
)
dictionary
#> # A tibble: 5 x 2
#>   new         old     
#>   <chr>       <chr>   
#> 1 studentName name    
#> 2 dob         birthday
#> 3 height      height  
#> 4 examScore   score   
#> 5 schoolGrade grade
# The data with a column that is not present in the dictionary
data <- tibble::tibble(
name = rnorm(5),
score = rnorm(5),
grade = rnorm(5),
new_var = rnorm(5)
)
data
#> # A tibble: 5 x 4
#>     name  score  grade new_var
#>    <dbl>  <dbl>  <dbl>   <dbl>
#> 1  0.678 -0.431  0.753  1.43  
#> 2  0.602 -0.532  1.15   0.356 
#> 3  1.68   1.40   0.410  0.0729
#> 4  0.817  1.84  -0.292  0.523 
#> 5 -0.316  0.954 -1.02   1.16

# get the data variable names where there is a new name in the dictionary
vars_found_in_dictionary <- intersect(names(data), unique(dictionary$old))
# create a temporary dictionary as a named vector
temp_dict <- dictionary %>% dplyr::filter(old %in% vars_found_in_dictionary) %>% tibble::deframe()
temp_dict
#> studentName   examScore schoolGrade 
#>      "name"     "score"     "grade"
# rename those variables only
data %>% dplyr::rename(!!temp_dict)
#> # A tibble: 5 x 4
#>   studentName examScore schoolGrade new_var
#>         <dbl>     <dbl>       <dbl>   <dbl>
#> 1       0.678    -0.431       0.753  1.43  
#> 2       0.602    -0.532       1.15   0.356 
#> 3       1.68      1.40        0.410  0.0729
#> 4       0.817     1.84       -0.292  0.523 
#> 5      -0.316     0.954      -1.02   1.16

创建于 2020-02-08 由 reprex 软件包 (v0.3.0(

此外,在创建函数时,最好检查字典中是否没有匹配的变量名称

编辑

由于您要选择字典

library(tidyverse)
people <- tibble::tibble(
name = rnorm(5),
birthady = rnorm(5),
new_var = rnorm(5)
)
test <- tibble::tibble(
grade = rnorm(5),
score = rnorm(5)
)
people_dictionary <- tibble::tibble(
new = c("studentName", "dob", "height"),
old = c("name", "birthday", "height")
)

test_dictionary <- tibble::tibble(
new = c( "examScore", "schoolGrade"),
old = c("score", "grade")
)
dictionary_list <- 
list( "people" = "people_dictionary",
"test" =   "test_dictionary"
)

dictionary <- function(data) { 
dictionary <- deparse(substitute(data))
eval(parse(text= dictionary_list[dictionary][[1]]))
}

dictionary(people)
#> # A tibble: 3 x 2
#>   new         old     
#>   <chr>       <chr>   
#> 1 studentName name    
#> 2 dob         birthday
#> 3 height      height
dictionary(test)
#> # A tibble: 2 x 2
#>   new         old  
#>   <chr>       <chr>
#> 1 examScore   score
#> 2 schoolGrade grade

创建于 2020-02-08 由 reprex 软件包 (v0.3.0(

最新更新