在数据帧中,我想将一些值转换为NA
。哪些值应变为NA
取决于列。这个基于列的值规范在一个单独的列表对象中给出。我想写一个可以接受的函数:
- 要清理的数据帧
- 指定要清理的列的矢量
- 指定每个值的列表适用于每列
并且将返回一个干净的数据帧,其中不需要的值变为NA
。虽然这样的任务可以通过for
循环来实现,但我正在努力弄清楚是否有更简单的迭代方法来实现。我通常喜欢tidyverse
解决方案,但对任何想法都很满意。
示例数据
在下面的数据集中,每列都有自己的一组有效值,这些值应该保留下来,其余的值应该变成NA
。
library(tibble)
set.seed(2020)
## generate random strings: https://stackoverflow.com/a/42734863/6105259
sample_strings <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(letters, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(letters, n, TRUE))
}
df <-
tibble(id = 1:40,
color = sample(c(1:5), size = 40, replace = TRUE),
political = sample(c(sample(c("republican", "democrat", "green_party", "libertarian"), size = 20, replace = TRUE),
sample_strings(20))),
religion_status = sample(c(sample(c("secular", "traditional", "religious", "atheist", "agnostic"), size = 20, replace = TRUE),
sample_strings(20)))
)
## # A tibble: 40 x 4
## id color political religion_status
## <int> <int> <chr> <chr>
## 1 1 4 republican fzwue3975k
## 2 2 4 republican mgxoe2220e
## 3 3 1 democrat secular
## 4 4 1 republican secular
## 5 5 4 aibcg6459y oqnfv1461t
## 6 6 2 aedqi0739y ufhua9648n
## 7 7 1 zgvox0771x agnostic
## 8 8 5 democrat traditional
## 9 9 2 republican atheist
## 10 10 2 oxgge5657l nktsl2136o
## # ... with 30 more rows
知道哪些值应该保存在以下列表中的哪一列中:
var_mapping_list <- list(preferences = list(age = list(originType = "NumberQuestionPage",
originIndex = 6L, title = "what is your age?", valueDescriptions = NULL),
political = list(originType = "QuestionPage", originIndex = 7L,
title = "what is your political affiliation?", valueDescriptions = list(
republican = "I have voted most of my life to the republican party",
democrat = "I have voted most of my life to the democratic party",
other = "other")), religion_status = list(originType = "QuestionPage",
originIndex = 9L, title = "how do you define your religiousness level? ",
valueDescriptions = list(secular = "I don't practice any religion although I do belong to one",
traditional = "I'm observant and keep some of the practices",
religious = "I practice a religion", other = "other")),
color = list(originType = "QuestionPage", title = "which color do you like the best",
valueDescriptions = list(`1` = "red", `2` = "blue", `3` = "yellow",
`4` = "pink", `5` = "orange")), pet = list(originType = "QuestionPage",
originIndex = 0L, title = "do you have a pet? ", valueDescriptions = list(
yes = "yes", no = "no"))))
例如,一个变量
说我想清理df$political
。要知道要保留哪些值,我将首先访问:
var_mapping_list$preferences$political$valueDescriptions
## $republican
## [1] "I have voted most of my life to the republican party"
## $democrat
## [1] "I have voted most of my life to the democratic party"
## $other
## [1] "other"
我的规则是,除了other
之外的所有选项都是df
中相应列的有效值。
因此,这意味着在df$political
中,只有republican
和democrat
应该保留,其余的应该变成NA
。
因此,仅适用于df$political
的工作流示例为:
library(tidyr)
library(rlang)
library(dplyr)
vec_political_values_to_keep <-
var_mapping_list$preferences$political$valueDescriptions %>%
bind_rows %>%
pivot_longer(cols = tidyselect::everything(),
names_to = "option_key",
values_to = "description") %>%
filter(option_key != "other") %>%
pull(option_key)
df %>%
mutate(political = recode(political, !!!rlang::set_names(vec_political_values_to_keep), .default = NA_character_)) ## https://stackoverflow.com/a/63916563/6105259
## # A tibble: 40 x 4
## id color political religion_status
## <int> <int> <chr> <chr>
## 1 1 4 republican fzwue3975k
## 2 2 4 republican mgxoe2220e
## 3 3 1 democrat secular
## 4 4 1 republican secular
## 5 5 4 NA oqnfv1461t
## 6 6 2 NA ufhua9648n
## 7 7 1 NA agnostic
## 8 8 5 democrat traditional
## 9 9 2 republican atheist
## 10 10 2 NA nktsl2136o
我想将以上内容缩放到df
中感兴趣的任何变量
所需输出
指定矢量
colnames_to_clean <- c("color", "political", "religion_status")
[1] "color" "political" "religion_status"
应返回以下数据帧:
## id color political religion_status
## <int> <int> <chr> <chr>
## 1 1 4 republican NA
## 2 2 4 republican NA
## 3 3 1 democrat secular
## 4 4 1 republican secular
## 5 5 4 NA NA
## 6 6 2 NA NA
## 7 7 1 NA NA
## 8 8 5 democrat traditional
## 9 9 2 republican NA
## 10 10 2 NA NA
我将感谢你在这方面的帮助!
这里有一种可能性。首先,将有效值放入tibble
中。
new_list <- tibble(
name = names(var_mapping_list$preferences),
x = var_mapping_list$preferences
) %>%
mutate(all_vals = map2(x, name, ~ names(.x$valueDescriptions))) %>%
select(-x)
这样做的优点是,您现在可以轻松地在tidyverse中使用有效值。其次,连接有效值并检查当前值是否为有效值:
df %>%
gather(name, val, -id) %>%
left_join(new_list, by = "name") %>%
group_by(name) %>%
mutate(val = map2_chr(val, all_vals, ~if_else(.x %in% setdiff(.y, "other"), .x, NA_character_))) %>%
select(-all_vals) %>%
spread(name, val)
# A tibble: 40 x 4
id color political religion_status
<int> <chr> <chr> <chr>
1 1 4 republican NA
2 2 4 republican NA
3 3 1 democrat secular
4 4 1 republican secular
5 5 4 NA NA
6 6 2 NA NA
7 7 1 NA NA
8 8 5 democrat traditional
9 9 2 republican NA
10 10 2 NA NA
# ... with 30 more rows