删除基于多列的重复项,但使用R选择保留最后一列



我有一个脚本,可以根据三列中的重复值进行删除。有三个以上的列,但我想删除基于那些特定的

DF2021 <-DF2021 [!duplicated (DF2021[,c("column1","column2","column3")]),]

上面的脚本很有效,每次基于这三列出现重复时,我都会留下一行。

下一步是我想知道如何确保我留下了基于标准的行。例如,我想要NA最少的一行。

column1|column2|column3|column4|column5|column6|column 7
Jan     Tue    2020   Blue    Warm  Hospital    NA
Jan     Tue    2020   Blue    Warm     NA       NA
Jan     Tue    2020   Blue    NA       NA       NA
Feb     Thu    2020   Red     NA       NA       NA
Feb     Thu    2020   Red     Warm     NA       NA
Feb     Thu    2020   Red     Warm   Garden    Run
Mar     Thu    2020   Red     Cold   Desk      Bus

最后,我希望重复的值给我留下三行

column1|column2|column3|column4|column5|column6|column 7
Jan      Tue   2020    Blue    Warm   Hospital   NA
Feb      Thu   2020     Red    Warm   Garden    Run
Mar      Thu   2020     Red    Cold   Desk      Bus

请注意,我留下的行是最完整的行或基于最少NA的行。

UPDATE:我的意思是显示最完整的行或列填充最多的行。起初,这个问题让我觉得我只在寻找有完整列的列。我应该是其中一行没有完成第7列,但我仍然希望数据拉取该数据集,无论列是否完成。

在完成第一个删除重复值的任务后,尝试这种base R方法:

#Code
DF2021 <- DF2021[complete.cases(DF2021),]

输出为:

column1 column2 column3 column4 column5  column6
1     Jan     Tue    2020    Blue    Warm Hospital
6     Feb     Thu    2020     Red    Warm   Garden
7     Mar     Thu    2020     Red    Cold     Desk

使用的一些数据:

#Data
DF2021 <- structure(list(column1 = c("Jan", "Jan", "Jan", "Feb", "Feb", 
"Feb", "Mar"), column2 = c("Tue", "Tue", "Tue", "Thu", "Thu", 
"Thu", "Thu"), column3 = c(2020L, 2020L, 2020L, 2020L, 2020L, 
2020L, 2020L), column4 = c("Blue", "Blue", "Blue", "Red", "Red", 
"Red", "Red"), column5 = c("Warm", "Warm", NA, NA, "Warm", "Warm", 
"Cold"), column6 = c("Hospital", NA, NA, NA, NA, "Garden", "Desk"
)), row.names = c(NA, -7L), class = "data.frame")

我们可以按前3列进行分组,然后按slice对其余列上NA数量最少的行进行分组

library(dplyr)
df1 %>%
group_by(column1, column2, column3) %>%
slice(which.min(rowSums(is.na(select(cur_data(), column4:column7))))) %>%
ungroup

-输出

# A tibble: 3 x 7
#  column1 column2 column3 column4 column5 column6  column7
#  <chr>   <chr>     <int> <chr>   <chr>   <chr>    <chr>  
#1 Feb     Thu        2020 Red     Warm    Garden   Run    
#2 Jan     Tue        2020 Blue    Warm    Hospital <NA>   
#3 Mar     Thu        2020 Red     Cold    Desk     Bus    

数据

df1 <- structure(list(column1 = c("Jan", "Jan", "Jan", "Feb", "Feb", 
"Feb", "Mar"), column2 = c("Tue", "Tue", "Tue", "Thu", "Thu", 
"Thu", "Thu"), column3 = c(2020L, 2020L, 2020L, 2020L, 2020L, 
2020L, 2020L), column4 = c("Blue", "Blue", "Blue", "Red", "Red", 
"Red", "Red"), column5 = c("Warm", "Warm", NA, NA, "Warm", "Warm", 
"Cold"), column6 = c("Hospital", NA, NA, NA, NA, "Garden", "Desk"
), column7 = c(NA, NA, NA, NA, NA, "Run", "Bus")), 
class = "data.frame", row.names = c(NA, 
-7L))

最新更新