我有一个脚本,可以根据三列中的重复值进行删除。有三个以上的列,但我想删除基于那些特定的
DF2021 <-DF2021 [!duplicated (DF2021[,c("column1","column2","column3")]),]
上面的脚本很有效,每次基于这三列出现重复时,我都会留下一行。
下一步是我想知道如何确保我留下了基于标准的行。例如,我想要NA最少的一行。
column1|column2|column3|column4|column5|column6|column 7
Jan Tue 2020 Blue Warm Hospital NA
Jan Tue 2020 Blue Warm NA NA
Jan Tue 2020 Blue NA NA NA
Feb Thu 2020 Red NA NA NA
Feb Thu 2020 Red Warm NA NA
Feb Thu 2020 Red Warm Garden Run
Mar Thu 2020 Red Cold Desk Bus
最后,我希望重复的值给我留下三行
column1|column2|column3|column4|column5|column6|column 7
Jan Tue 2020 Blue Warm Hospital NA
Feb Thu 2020 Red Warm Garden Run
Mar Thu 2020 Red Cold Desk Bus
请注意,我留下的行是最完整的行或基于最少NA的行。
UPDATE:我的意思是显示最完整的行或列填充最多的行。起初,这个问题让我觉得我只在寻找有完整列的列。我应该是其中一行没有完成第7列,但我仍然希望数据拉取该数据集,无论列是否完成。
在完成第一个删除重复值的任务后,尝试这种base R
方法:
#Code
DF2021 <- DF2021[complete.cases(DF2021),]
输出为:
column1 column2 column3 column4 column5 column6
1 Jan Tue 2020 Blue Warm Hospital
6 Feb Thu 2020 Red Warm Garden
7 Mar Thu 2020 Red Cold Desk
使用的一些数据:
#Data
DF2021 <- structure(list(column1 = c("Jan", "Jan", "Jan", "Feb", "Feb",
"Feb", "Mar"), column2 = c("Tue", "Tue", "Tue", "Thu", "Thu",
"Thu", "Thu"), column3 = c(2020L, 2020L, 2020L, 2020L, 2020L,
2020L, 2020L), column4 = c("Blue", "Blue", "Blue", "Red", "Red",
"Red", "Red"), column5 = c("Warm", "Warm", NA, NA, "Warm", "Warm",
"Cold"), column6 = c("Hospital", NA, NA, NA, NA, "Garden", "Desk"
)), row.names = c(NA, -7L), class = "data.frame")
我们可以按前3列进行分组,然后按slice
对其余列上NA数量最少的行进行分组
library(dplyr)
df1 %>%
group_by(column1, column2, column3) %>%
slice(which.min(rowSums(is.na(select(cur_data(), column4:column7))))) %>%
ungroup
-输出
# A tibble: 3 x 7
# column1 column2 column3 column4 column5 column6 column7
# <chr> <chr> <int> <chr> <chr> <chr> <chr>
#1 Feb Thu 2020 Red Warm Garden Run
#2 Jan Tue 2020 Blue Warm Hospital <NA>
#3 Mar Thu 2020 Red Cold Desk Bus
数据
df1 <- structure(list(column1 = c("Jan", "Jan", "Jan", "Feb", "Feb",
"Feb", "Mar"), column2 = c("Tue", "Tue", "Tue", "Thu", "Thu",
"Thu", "Thu"), column3 = c(2020L, 2020L, 2020L, 2020L, 2020L,
2020L, 2020L), column4 = c("Blue", "Blue", "Blue", "Red", "Red",
"Red", "Red"), column5 = c("Warm", "Warm", NA, NA, "Warm", "Warm",
"Cold"), column6 = c("Hospital", NA, NA, NA, NA, "Garden", "Desk"
), column7 = c(NA, NA, NA, NA, NA, "Run", "Bus")),
class = "data.frame", row.names = c(NA,
-7L))