r-将一些数据帧值转换为NA:要转换的值依赖于列,并在单独的列表中给出



在数据帧中,我想将一些值转换为NA。哪些值应变为NA取决于列。这个基于列的值规范在一个单独的列表对象中给出。我想写一个可以接受的函数:

  1. 要清理的数据帧
  2. 指定要清理的列的矢量
  3. 指定每个值的列表适用于每列

并且将返回一个干净的数据帧,其中不需要的值变为NA。虽然这样的任务可以通过for循环来实现,但我正在努力弄清楚是否有更简单的迭代方法来实现。我通常喜欢tidyverse解决方案,但对任何想法都很满意。

示例数据

在下面的数据集中,每列都有自己的一组有效值,这些值应该保留下来,其余的值应该变成NA

library(tibble)
set.seed(2020)
## generate random strings: https://stackoverflow.com/a/42734863/6105259
sample_strings <- function(n = 5000) {
a <- do.call(paste0, replicate(5, sample(letters, n, TRUE), FALSE))
paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(letters, n, TRUE))
}
df <-
tibble(id = 1:40,
color = sample(c(1:5), size = 40, replace = TRUE),
political = sample(c(sample(c("republican", "democrat", "green_party", "libertarian"), size = 20, replace = TRUE),
sample_strings(20))),
religion_status = sample(c(sample(c("secular", "traditional", "religious", "atheist", "agnostic"), size = 20, replace = TRUE), 
sample_strings(20)))
)
## # A tibble: 40 x 4
##       id color political  religion_status
##    <int> <int> <chr>      <chr>          
##  1     1     4 republican fzwue3975k     
##  2     2     4 republican mgxoe2220e     
##  3     3     1 democrat   secular        
##  4     4     1 republican secular        
##  5     5     4 aibcg6459y oqnfv1461t     
##  6     6     2 aedqi0739y ufhua9648n     
##  7     7     1 zgvox0771x agnostic       
##  8     8     5 democrat   traditional    
##  9     9     2 republican atheist        
## 10    10     2 oxgge5657l nktsl2136o     
## # ... with 30 more rows

知道哪些值应该保存在以下列表中的哪一列中:

var_mapping_list <- list(preferences = list(age = list(originType = "NumberQuestionPage", 
originIndex = 6L, title = "what is your age?", valueDescriptions = NULL), 
political = list(originType = "QuestionPage", originIndex = 7L, 
title = "what is your political affiliation?", valueDescriptions = list(
republican = "I have voted most of my life to the republican party", 
democrat = "I have voted most of my life to the democratic party", 
other = "other")), religion_status = list(originType = "QuestionPage", 
originIndex = 9L, title = "how do you define your religiousness level? ", 
valueDescriptions = list(secular = "I don't practice any religion although I do belong to one", 
traditional = "I'm observant and keep some of the practices", 
religious = "I practice a religion", other = "other")), 
color = list(originType = "QuestionPage", title = "which color do you like the best", 
valueDescriptions = list(`1` = "red", `2` = "blue", `3` = "yellow", 
`4` = "pink", `5` = "orange")), pet = list(originType = "QuestionPage", 
originIndex = 0L, title = "do you have a pet? ", valueDescriptions = list(
yes = "yes", no = "no"))))

例如,一个变量

说我想清理df$political。要知道要保留哪些值,我将首先访问:

var_mapping_list$preferences$political$valueDescriptions
## $republican
## [1] "I have voted most of my life to the republican party"
## $democrat
## [1] "I have voted most of my life to the democratic party"
## $other
## [1] "other"

我的规则是,除了other之外的所有选项都是df中相应列的有效值。

因此,这意味着在df$political中,只有republicandemocrat应该保留,其余的应该变成NA

因此,仅适用于df$political的工作流示例为:

library(tidyr)
library(rlang)
library(dplyr)
vec_political_values_to_keep <-
var_mapping_list$preferences$political$valueDescriptions %>%
bind_rows %>%
pivot_longer(cols = tidyselect::everything(), 
names_to = "option_key", 
values_to = "description") %>%
filter(option_key != "other") %>%
pull(option_key)

df %>% 
mutate(political = recode(political, !!!rlang::set_names(vec_political_values_to_keep), .default = NA_character_)) ## https://stackoverflow.com/a/63916563/6105259

## # A tibble: 40 x 4
##       id color political  religion_status
##    <int> <int> <chr>      <chr>          
##  1     1     4 republican fzwue3975k     
##  2     2     4 republican mgxoe2220e     
##  3     3     1 democrat   secular        
##  4     4     1 republican secular        
##  5     5     4 NA         oqnfv1461t     
##  6     6     2 NA         ufhua9648n     
##  7     7     1 NA         agnostic       
##  8     8     5 democrat   traditional    
##  9     9     2 republican atheist        
## 10    10     2 NA         nktsl2136o  

我想将以上内容缩放到df中感兴趣的任何变量

所需输出

指定矢量

colnames_to_clean <- c("color", "political", "religion_status")
[1] "color"           "political"       "religion_status"

应返回以下数据帧:

##       id color political  religion_status
##    <int> <int> <chr>      <chr>          
##  1     1     4 republican NA             
##  2     2     4 republican NA             
##  3     3     1 democrat   secular        
##  4     4     1 republican secular        
##  5     5     4 NA         NA             
##  6     6     2 NA         NA             
##  7     7     1 NA         NA             
##  8     8     5 democrat   traditional    
##  9     9     2 republican NA             
## 10    10     2 NA         NA  

我将感谢你在这方面的帮助!

这里有一种可能性。首先,将有效值放入tibble中。

new_list <- tibble(
name  = names(var_mapping_list$preferences),
x = var_mapping_list$preferences
) %>%
mutate(all_vals = map2(x, name, ~ names(.x$valueDescriptions))) %>%
select(-x)

这样做的优点是,您现在可以轻松地在tidyverse中使用有效值。其次,连接有效值并检查当前值是否为有效值:

df %>%
gather(name, val, -id) %>%
left_join(new_list, by = "name") %>% 
group_by(name) %>%
mutate(val = map2_chr(val, all_vals, ~if_else(.x %in% setdiff(.y, "other"), .x, NA_character_))) %>%
select(-all_vals) %>%
spread(name, val)
# A tibble: 40 x 4
id color political  religion_status
<int> <chr> <chr>      <chr>          
1     1 4     republican NA             
2     2 4     republican NA             
3     3 1     democrat   secular        
4     4 1     republican secular        
5     5 4     NA         NA             
6     6 2     NA         NA             
7     7 1     NA         NA             
8     8 5     democrat   traditional    
9     9 2     republican NA             
10    10 2     NA         NA             
# ... with 30 more rows

最新更新