我正在创建一个数据分析包,从各种来源抓取数据。 在许多情况下,列的名称对于每个数据源中相同类型的数据具有不一致的命名。
就我而言,我希望创建一个重命名函数,该函数调用格式化为tbl_df
的字典,并重命名列。
当所有原始列都有新的列名时,此示例有效,但当tbl_df
中存在不属于字典的其他列时,则不适用
library(tidyverse)
df1 <- tibble::tribble(
~name, ~birthday, ~height,
"John Smith", "01/20/1990", "5'10"
)
df2 <- tibble::tribble(
~name, ~score, ~grade,
"John Smith", 95, 8
)
# column renaming dictionaries
people_dictionary <- tibble::tribble(
~namePeople, ~nameActual,
"name", "studentName",
"birthday", "dob",
"height", "height"
)
test_dictionary <- tibble::tribble(
~nameTest, ~nameActual,
"name", "studentName",
"score", "examScore",
"grade", "schoolGrade"
)
rename_function <- function(data, data_source = "people") {
# find dictionary based on data source
if (data_source == "people") {
actual_names_df <- people_dictionary
}
if (data_source == "test") {
actual_names_df <- test_dictionary
}
# get column names of data
original_names <- colnames(data)
# create column name filter depending on data source
name_columns <- case_when(
data_source == "people" ~ "namePeople",
data_source == "people" ~ "nameTest"
)
# Match Original Names to Renamed Column Names
actual_names <-
seq_along(original_names) %>%
purrr::map_chr(function(x) {
actual <-
actual_names_df %>%
# rlang used to unquote dynamic name column
filter((!!rlang::sym(name_columns)) == original_names[x]) %>%
.$nameActual
})
# rename columns
data <- data %>%
purrr::set_names(actual_names)
}
renamed_df1 <- df1 %>% rename_function(data_source = "people")
# original df
df1
#> # A tibble: 1 x 3
#> name birthday height
#> <chr> <chr> <chr>
#> 1 John Smith 01/20/1990 5'10
# renamed columns of df1
renamed_df1
#> # A tibble: 1 x 3
#> studentName dob height
#> <chr> <chr> <chr>
#> 1 John Smith 01/20/1990 5'10
# additional column not named in dictionary
df3 <- tibble::tribble(
~name, ~birthday, ~height, ~weight,
"John Smith", "01/20/1990", "5'10", 165
)
df3 %>% rename_function(data_source = "people")
#> Error: Result 4 must be a single string, not a character vector of length 0
创建于 2020-02-06 由 reprex 软件包 (v0.3.0(
我认为我的重命名功能有几部分可以改进:
- 有没有更好的方法来调用正确的字典(
tbl_df
(,而不是大量的if
语句? 我是否可以创建一个tbl_df
或csv
,其中包含一个用于data_source
的列和另一个列出包中tbl_df
名称的列?
tibble::tribble(
~data_source, ~dictionary_name,
"people", "people_dictionary",
"test", "test_dictionary"
)
- 如何
- 重新设计我的函数以遵循类似的工作流程,但在字典中所有列都没有新名称时不出现重命名错误?
- 对于我的用例,是否有更好的过程将"字典"存储在
r
包中?
我认为您可以为所有导入的数据集使用一个字典。 一个技巧是使用带有rename
的新旧变量名称的命名向量。您仍然可以将字典放在tibble
中,并使用tibble::deframe
创建命名向量
[请注意,命名向量应c(new1 = old1, new2 = old2,..)
library(tidyverse)
# Our dictionary
dictionary <- tibble::tibble(
new = c("studentName", "dob", "height", "examScore", "schoolGrade"),
old = c("name", "birthday", "height", "score", "grade")
)
dictionary
#> # A tibble: 5 x 2
#> new old
#> <chr> <chr>
#> 1 studentName name
#> 2 dob birthday
#> 3 height height
#> 4 examScore score
#> 5 schoolGrade grade
# The data with a column that is not present in the dictionary
data <- tibble::tibble(
name = rnorm(5),
score = rnorm(5),
grade = rnorm(5),
new_var = rnorm(5)
)
data
#> # A tibble: 5 x 4
#> name score grade new_var
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.678 -0.431 0.753 1.43
#> 2 0.602 -0.532 1.15 0.356
#> 3 1.68 1.40 0.410 0.0729
#> 4 0.817 1.84 -0.292 0.523
#> 5 -0.316 0.954 -1.02 1.16
# get the data variable names where there is a new name in the dictionary
vars_found_in_dictionary <- intersect(names(data), unique(dictionary$old))
# create a temporary dictionary as a named vector
temp_dict <- dictionary %>% dplyr::filter(old %in% vars_found_in_dictionary) %>% tibble::deframe()
temp_dict
#> studentName examScore schoolGrade
#> "name" "score" "grade"
# rename those variables only
data %>% dplyr::rename(!!temp_dict)
#> # A tibble: 5 x 4
#> studentName examScore schoolGrade new_var
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.678 -0.431 0.753 1.43
#> 2 0.602 -0.532 1.15 0.356
#> 3 1.68 1.40 0.410 0.0729
#> 4 0.817 1.84 -0.292 0.523
#> 5 -0.316 0.954 -1.02 1.16
创建于 2020-02-08 由 reprex 软件包 (v0.3.0(
此外,在创建函数时,最好检查字典中是否没有匹配的变量名称
编辑
由于您要选择字典
library(tidyverse)
people <- tibble::tibble(
name = rnorm(5),
birthady = rnorm(5),
new_var = rnorm(5)
)
test <- tibble::tibble(
grade = rnorm(5),
score = rnorm(5)
)
people_dictionary <- tibble::tibble(
new = c("studentName", "dob", "height"),
old = c("name", "birthday", "height")
)
test_dictionary <- tibble::tibble(
new = c( "examScore", "schoolGrade"),
old = c("score", "grade")
)
dictionary_list <-
list( "people" = "people_dictionary",
"test" = "test_dictionary"
)
dictionary <- function(data) {
dictionary <- deparse(substitute(data))
eval(parse(text= dictionary_list[dictionary][[1]]))
}
dictionary(people)
#> # A tibble: 3 x 2
#> new old
#> <chr> <chr>
#> 1 studentName name
#> 2 dob birthday
#> 3 height height
dictionary(test)
#> # A tibble: 2 x 2
#> new old
#> <chr> <chr>
#> 1 examScore score
#> 2 schoolGrade grade
创建于 2020-02-08 由 reprex 软件包 (v0.3.0(