r语言 - 基于多个列删除重复项，但"most"至少选择完整版本的重复项 - r - Delete Duplicates based on multiple columns but select the "most" complete version of the duplicates by least NA's 小贝子编程网

我有一个像这样的代码

Month|  Day|   Year| Color|   Weather|Location|Transporation|ID
Jan     Tue    2020   Blue    Warm    Hospital    NA         1
Jan     Tue    2020   Blue    Warm     NA         NA         1
Jan     Tue    2020   Blue    NA       NA         NA         1
Feb     Thu    2020   Red     NA       NA         NA         2
Feb     Thu    2020   Red     Warm     NA         NA         2
Feb     Thu    2020   Red     Warm    Garden      Run        2
Mar     Thu    2020   Red     Cold    Desk        Bus        3

我希望它看起来像这样

Month|   Day|  Year|   Color|  Weather|Location|  Transporation|ID
Jan      Tue   2020    Blue    Warm    Hospital   NA            1
Feb      Thu   2020     Red    Warm    Garden     Run           2
Mar      Thu   2020     Red    Cold    Desk       Bus           3

基本上，我想通过选择三个c(ID,Month,Color)来确定一个列是否是重复的。一旦确定了副本，我希望它删除具有最多NA或"最少完成"的副本。

也许这样可以，我使用rowsum (is.na())按行列出缺失的项目数量，然后按ID、Month、Color分组，并筛选到缺失最少的行:

library(dplyr)
dat<-data.frame("Month" = c("Jan", "Jan", "Jan", "Feb", "Feb", "Feb", "Mar"),
"Day" = c("Tue", "Tue", "Tue", "Thu", "Thu", "Thu", "Thu"),
"Year" = rep(2020,7),
"Color" = c(rep("Blue", 3), rep("Red", 4)),
"Weather" = c("Warm", "Warm", NA, NA, "Warm", "Warm", "Cold"),
"Location" = c("Hospital", rep(NA, 4), "Garden", "Desk"),
"Transporation" = c(rep(NA, 5), "Run", "Bus"),
"ID" = c(1, 1, 1, 2, 2, 2, 3)
)%>%
mutate(Missing = rowSums(is.na(.)))%>% #Making a sum of how many missing items per row
group_by(ID, Month, Color)%>%
filter(Missing == min(Missing))%>% #Filtering to the least amount of missing
ungroup()%>%
select(-Missing) #Removing the missing column as it was only used to filter

我们可以使用order在按感兴趣的列分组后选择第一个非na元素

library(dplyr)
dat %>%
group_by(Month, Day, Year) %>% 
summarise(across(everything(), ~ first(.[order(is.na(.))])), .groups = 'drop')

与产出

# A tibble: 3 x 8
Month Day    Year Color Weather Location Transporation    ID
<chr> <chr> <dbl> <chr> <chr>   <chr>    <chr>         <dbl>
1 Feb   Thu    2020 Red   Warm    Garden   Run               2
2 Jan   Tue    2020 Blue  Warm    Hospital <NA>              1
3 Mar   Thu    2020 Red   Cold    Desk     Bus               3

数据

dat <- structure(list(Month = c("Jan", "Jan", "Jan", "Feb", "Feb", "Feb", 
"Mar"), Day = c("Tue", "Tue", "Tue", "Thu", "Thu", "Thu", "Thu"
), Year = c(2020, 2020, 2020, 2020, 2020, 2020, 2020), Color = c("Blue", 
"Blue", "Blue", "Red", "Red", "Red", "Red"), Weather = c("Warm", 
"Warm", NA, NA, "Warm", "Warm", "Cold"), Location = c("Hospital", 
NA, NA, NA, NA, "Garden", "Desk"), Transporation = c(NA, NA, 
NA, NA, NA, "Run", "Bus"), ID = c(1, 1, 1, 2, 2, 2, 3)), class = "data.frame", row.names = c(NA, 
-7L))

使用数据。表库，如果您的数据已经在j:

j <- as.data.table(your_data)
j
Month    Day  Year  Color Weather Location Transporation    ID
<char> <char> <int> <char>  <char>   <char>        <char> <int>
1:    Jan    Tue  2020   Blue    Warm Hospital          <NA>     1
2:    Jan    Tue  2020   Blue    Warm     <NA>          <NA>     1
3:    Jan    Tue  2020   Blue    <NA>     <NA>          <NA>     1
4:    Feb    Thu  2020    Red    <NA>     <NA>          <NA>     2
5:    Feb    Thu  2020    Red    Warm     <NA>          <NA>     2
6:    Feb    Thu  2020    Red    Warm   Garden           Run     2
7:    Mar    Thu  2020    Red    Cold     Desk           Bus     3
j$n_na  <- apply(j, MARGIN = 1, function(x) sum(is.na(x)))
setorder(j,n_na)
k <- unique(j,by=c("ID","Month","Color"))
setorder(k,ID)
k
Month    Day  Year  Color Weather Location Transporation    ID  n_na
<char> <char> <int> <char>  <char>   <char>        <char> <int> <int>
1:    Jan    Tue  2020   Blue    Warm Hospital          <NA>     1     1
2:    Feb    Thu  2020    Red    Warm   Garden           Run     2     0
3:    Mar    Thu  2020    Red    Cold     Desk           Bus     3     0

毕竟k将按照您的要求保存数据。问候,米格尔

r语言 - 基于多个列删除重复项，但"most"至少选择完整版本的重复项

相关内容

最新更新

热门标签：