R:确定一组列中的冗余和唯一值



我希望确定一组列中的值是冗余的时候,在一个新的列multi?中记录它,其中0意味着只看到一个值,1意味着看到多个值。当值"Unspecified"与其他值在一起时,我希望代码忽略它并相应地评估其他值的冗余性。当值"Unspecified"是列集中唯一的值时,我希望列multi?记录"Unspecified"

值得注意的是,这四个列只是一个更大的数据库的一部分,这个数据库有更多的列。

为了说明我的意思,我在下面提供了一个输入和输出示例:

headbleed_type_dx1 headbleed_type_dx2 headbleed_type_dx3 headbleed_type_dx4
1      Intracerebral      Intracerebral      Intracerebral               <NA>      
2      Intracerebral      Subarachnoid                <NA>           Subdural      
3        Unspecified      Intracerebral           Subdural      Intracerebral      
4        Unspecified               <NA>                <NA>               <NA>               
5               <NA>               <NA>                <NA>               <NA>               

如果Multi?的行为1,那么我还想记录新列Number

中唯一值的数量
Multi?       Number
1 0            1
2 1            3
3 1            2
4 Unspecified  1
5 NA           NA 

这真的很麻烦,我真的建议不要在列中混合数字和字符。话虽如此,如果您愿意接受基于dplyr的解决方案

library(dplyr)
data %>% 
rowwise() %>% 
summarise(
number = n_distinct(
c_across(headbleed_type_dx1:headbleed_type_dx4), 
na.rm = TRUE),
unspec = coalesce(
any(c_across(headbleed_type_dx1:headbleed_type_dx4) == "Unspecified"), 
FALSE)) %>% 
mutate(
number2 = if_else(number > 1L & unspec, number - 1L, na_if(number, 0)),
multi = case_when(number == 1 & unspec ~ "Unspecific",
number2 == 1 ~ "0",
is.na(number2) ~ NA_character_,
TRUE ~ "1"),
.keep = "none") %>% 
select(number = number2, multi)

这返回

# A tibble: 6 × 2
number multi     
<int> <chr>     
1      1 0         
2      3 1         
3      2 1         
4      1 Unspecific
5     NA NA        
6      1 0       

数据
structure(list(headbleed_type_dx1 = c("Intracerebral", "Intracerebral", 
"Unspecified", "Unspecified", NA, "Intracerebral"), headbleed_type_dx2 = c("Intracerebral", 
"Subarachnoid", "Intracerebral", NA, NA, "Unspecified"), headbleed_type_dx3 = c("Intracerebral", 
NA, "Subdural", NA, NA, "Intracerebral"), headbleed_type_dx4 = c(NA, 
"Subdural", "Intracerebral", NA, NA, NA)), problems = structure(list(
row = 1:4, col = c(NA_character_, NA_character_, NA_character_, 
NA_character_), expected = c("4 columns", "4 columns", "4 columns", 
"4 columns"), actual = c("5 columns", "5 columns", "5 columns", 
"5 columns"), file = c("literal data", "literal data", "literal data", 
"literal data")), row.names = c(NA, -4L), class = c("tbl_df", 
"tbl", "data.frame")), class = c("spec_tbl_df", "tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -6L), spec = structure(list(
cols = list(headbleed_type_dx1 = structure(list(), class = c("collector_character", 
"collector")), headbleed_type_dx2 = structure(list(), class = c("collector_character", 
"collector")), headbleed_type_dx3 = structure(list(), class = c("collector_character", 
"collector")), headbleed_type_dx4 = structure(list(), class = c("collector_character", 
"collector"))), default = structure(list(), class = c("collector_guess", 
"collector")), skip = 1L), class = "col_spec"))

最新更新