r-当仅使用dplyr应用accros 3列条件时,如何填充Yes和No



我正在尝试创建一个新列,比如test,它有几个基于3列的条件。我只想用一点点时间来实现这一点。以下是我的条件:

  • 如果我在一列中有Yes,而不考虑列间相同id中的其他类别(No/Unknown/NA(:previous_cabg, previous_pci, previous_ami然后在测试变量中分配Yes
  • 如果同一id的所有列中都有No,则为测试变量指定NO
  • 如果一列中有NO,其他列中有相同id的NA/Unknown,则在测试变量中使用No赋值
  • 如果同一id的所有列中都有Yes,则在测试变量中分配Yes
  • 如果我在测试变量中有Yes in one column andNA/未知for the same id in each column then assignYes

这是我拥有的数据集类型:

structure(list(id = c(112139L, 43919L, 92430L, 87137L, 95417L, 
66955L, 16293L, 61396L, 25379L, 79229L, 27107L, 63243L, 50627L, 
17968L, 83015L, 96549L, 7332L, 4873L, 98131L, 93506L, 52894L, 
59327L, 85003L, 96623L, 82999L, 65769L, 67063L, 21744L, 62961L, 
2229L, 103673L, 9367L, 60215L, 74044L, 58422L, 57530L, 100399L, 
46483L, 108690L, 62017L, 46467L, 79562L, 4800L, 119158L, 103222L, 
32908L, 14491L, 30293L, 52558L, 122304L, 42281L, 1553L, 111771L, 
23087L, 30147L, 37842L, 51552L, 20148L, 28L, 7477L), previous_cabg = structure(c(1L, 
1L, 1L, NA, 1L, NA, NA, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, NA, 1L, 1L, NA, 1L, NA, 1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 3L, 
1L, 1L, NA, 1L, 1L, 1L, 1L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 1L, 1L, 1L), .Label = c("No", 
"Unknown", "Yes"), class = "factor"), previous_pci = structure(c(1L, 
1L, 2L, NA, 1L, NA, NA, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 
2L, NA, 2L, 1L, NA, 2L, NA, 1L, 2L, 1L, 1L, 1L, NA, 2L, 1L, 1L, 
2L, 2L, NA, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 2L, 1L, 1L), .Label = c("No", 
"Yes", "Unknown"), class = "factor"), previous_ami = structure(c(2L, 
2L, 1L, 2L, 2L, NA, 2L, 1L, 2L, 2L, NA, 1L, 2L, 2L, 2L, 2L, 2L, 
1L, NA, 1L, 2L, NA, 1L, NA, 2L, 1L, 2L, 2L, 2L, NA, 1L, 1L, 1L, 
2L, 1L, NA, 2L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 3L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, NA, 2L, 2L, 2L, 1L, 2L), .Label = c("Yes", 
"No", "Unknown"), class = "factor")), row.names = c(NA, -60L), problems = structure(list(
row = c(34136L, 121773L, 121779L), col = c("1.01 Hospital identifier", 
"1.01 Hospital identifier", "1.01 Hospital identifier"), 
expected = c("value in level set", "value in level set", 
"value in level set"), actual = c("CMH", "CMH", "CMH"), file = c("'../../data/changed/minap_2020_2021_second.csv'", 
"'../../data/changed/minap_2020_2021_second.csv'", "'../../data/changed/minap_2020_2021_second.csv'"
)), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"
)), class = c("tbl_df", "tbl", "data.frame"))

这就是它的样子,但只有前10行,如果你仔细看的话,我在3列中有不同的匹配组

# A tibble: 60 x 4
id previous_cabg previous_pci previous_ami
<int> <fct>         <fct>        <fct>       
1 112139 No            No           No          
2  43919 No            No           No          
3  92430 No            Yes          Yes         
4  87137 NA            NA           No          
5  95417 No            No           No          
6  66955 NA            NA           NA          
7  16293 NA            NA           No          
8  61396 No            Yes          Yes         
9  25379 No            Yes          No          
10  79229 No            No           No        

我希望只能用tidyverse或tidyverses和r base的组合来解决这个问题。

这是我尝试过的,但我觉得不太明智。我认为这是不明智的,因为这段代码将是自动化过程的一部分,如果我将获得除Yes and No之外的其他类别,如Unknown,因为它稍后出现在下一个数据集提取中,那么我希望代码将避免我上面给出的条件中的所有其他情况。

dplyr::mutate(first_attack = 
dplyr::case_when(previous_cabg == 'No'  | previous_pci == 'No'  | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'No'  | previous_pci == 'Yes' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'No'  | previous_ami == 'Yes' ~ 'Yes', 
previous_cabg == 'Yes' | previous_pci == 'Yes' | previous_ami == 'No' ~  'Yes', 
previous_cabg == 'No'  | previous_pci == 'No'  | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'No'  | previous_pci == 'Yes' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'No'  | previous_ami == 'Yes' ~ 'Yes', 
previous_cabg == 'Yes' | previous_pci == 'Yes' | previous_ami == 'No' ~  'Yes'

# deal with the unknown category
previous_cabg == 'Unknown'  | previous_pci == 'Yes' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'Unknown'  | previous_ami == 'Yes' ~ 'Yes', 
previous_cabg == 'Yes' | previous_pci == 'Yes' | previous_ami == 'No' ~  'Yes', 
previous_cabg == 'Unknown'  | previous_pci == 'Unknown'  | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'Unknown'  | previous_pci == 'Yes' | previous_ami == 'Yes' ~ 'Yes',
previous_cabg == 'Yes' | previous_pci == 'Unknown' | previous_ami == 'Yes' ~ 'Yes', 
previous_cabg == 'Yes' | previous_pci == 'Yes' | previous_ami == 'Unknown' ~  'Yes', 


previous_cabg == 'Yes' |  previous_pci == 'No'  | previous_ami == 'Yes' ~  'Yes', 
previous_cabg == 'Yes' |  previous_pci == 'No'  | previous_ami == 'No'  ~  'Yes',
previous_cabg == 'No'  |  previous_pci == 'No'  | previous_ami == 'Yes' ~  'Yes',
previous_cabg == 'No'  | previous_pci == 'Yes'  | previous_ami == 'No'  ~ 'Yes', 


previous_cabg == 'Yes' |  previous_pci == 'Unknown'   | previous_ami == 'Yes' ~  'Yes', 
previous_cabg == 'Yes' |  previous_pci == 'Unknown'   | previous_ami == 'Unknown'   ~  'Yes',
previous_cabg == 'Unknown'   |  previous_pci == 'Unknown'   | previous_ami == 'Yes' ~  'Yes',
previous_cabg == 'Unknown'   | previous_pci == 'Yes'  | previous_ami == 'Unknown'   ~ 'Yes', 


previous_cabg == 'Yes' | previous_pci == 'Unknown' | previous_ami == 'Unknown' ~ 'Yes', 
previous_cabg == 'Unknown'  | previous_pci == 'Yes'| previous_ami == 'Unknown' ~ 'Yes', 
previous_cabg == 'Yes' | previous_pci == 'No' | previous_ami == 'Yes' ~ 'Yes', 
previous_cabg == 'Unknown'  | previous_pci == 'Yes'| previous_ami == 'Yes' ~ 'Yes', 

previous_cabg == 'Yes' | previous_pci == 'No' | previous_ami == 'No' ~ 'Yes', 
previous_cabg == 'No'  | previous_pci == 'Yes'| previous_ami == 'No' ~ 'Yes', 
previous_cabg == 'Yes' | previous_pci == 'No' | previous_ami == 'Yes' ~ 'Yes', 
previous_cabg == 'No'  | previous_pci == 'Yes'| previous_ami == 'Yes' ~ 'Yes', 

previous_cabg == 'Yes' | previous_pci == 'Unknown' | previous_ami == 'Unknown' ~ 'Yes', 
previous_cabg == 'Unknown'  | previous_pci == 'Yes'| previous_ami == 'Unknown' ~ 'Yes', 
previous_cabg == 'Yes' | previous_pci == 'Unknown' | previous_ami == 'Yes' ~ 'Yes', 
previous_cabg == 'Unknown' | previous_pci == 'Yes'| previous_ami == 'Yes' ~ 'Yes', 


previous_cabg == 'No'  | previous_pci == 'No'  |  previous_ami == 'No' ~ 'No', 
previous_cabg == 'Yes' | previous_pci == 'Yes' |  previous_ami == 'Yes' ~'Yes'

))

这些操作是rowwise(),所以效率不是很高,但tidyverse中的这个解决方案应该可以很好地实现您想要的。

让我们将您的示例数据集命名为dataset。然后是以下工作流程

library(tidyverse)

# ...
# Code to generate your 'dataset'.
# ...

# Define custom logic across a single row.
get_first_attack <- function(values_across_row) {
# "Yes" overrides all other values.
if(isTRUE(any(values_across_row == "Yes"))){
return("Yes")
}
# "No" overrides all missing values: 'NA' and "Unknown".
else if(isTRUE(any(values_across_row == "No"))) {
return("No")
}
# "Unknown" overrides all other missing values: 'NA'.
else if(isTRUE(any(values_across_row == "Unknown"))) {
return("Unknown")
}
# All values are missing: 'NA'.
else {
return(as.character(NA))
}
}

dataset %>%
# Examine row by row.
dplyr::rowwise() %>%
# Compare values across each row according to the logic in 'get_first_attack()'.
dplyr::mutate(first_attack = get_first_attack(across(previous_cabg:previous_ami))) %>%
# Exit row-wise approach, to restore efficiency.
dplyr::ungroup() %>%
# Factor 'first_attack' exactly like its neighboring column.
dplyr::mutate(first_attack = factor(first_attack, levels = levels(previous_ami)))

应该给你这些结果

# A tibble: 60 x 5
id previous_cabg previous_pci previous_ami first_attack
<int> <fct>         <fct>        <fct>        <fct>       
1 112139 No            No           No           No          
2  43919 No            No           No           No          
3  92430 No            Yes          Yes          Yes         
4  87137 NA            NA           No           No          
5  95417 No            No           No           No          
6  66955 NA            NA           NA           NA          
7  16293 NA            NA           No           No          
8  61396 No            Yes          Yes          Yes         
9  25379 No            Yes          No           Yes         
10  79229 No            No           No           No          
# ... with 50 more rows

其中first_attack列被恰当地定义为具有三个级别的factor"Yes""No""Unknown"

总之,您的条件是:

  • 对于每一行,如果任何列为"是",则输出"是">
  • 对于每一行,如果所有列都为NA,则输出NA
  • 对于每一行,如果所有列都是"未知",则输出"未知">
  • 否则输出"否">

如果是这种情况,您可以执行:

# Convert your data structure into a data.frame
dat <- as.data.frame(dat)
# Remove id col
id <- dat$id
dat <- subset(dat, select = -c(id))
# For each row, check if there is a 'Yes' under any column. If so, return 'Yes'; otherwise return 'No'
output <- apply(dat, 1, function(x) ifelse('Yes' %in% x, 'Yes', 'No'))
# For each row, check if NA under all column. If so, return TRUE; otherwise return FALSE.
isNA <- apply(dat, 1, function(x) ifelse(all(is.na(x)), TRUE, FALSE))
# Now merge output and isNA
output[isNA] <- NA
# For each row, check if 'Unknown' under all column. If so, return TRUE; otherwise return FALSE.
isUK <- apply(dat, 1, function(x) ifelse(all('Unknown' == x), TRUE, FALSE))
# Now merge output and isUK
output[isUK] <- 'Unknown'
# Append the output character vector to a new col of the data frame
dat$id <- id
dat$test <- output

最新更新