我有一个像这样的代码
Month| Day| Year| Color| Weather|Location|Transporation|ID
Jan Tue 2020 Blue Warm Hospital NA 1
Jan Tue 2020 Blue Warm NA NA 1
Jan Tue 2020 Blue NA NA NA 1
Feb Thu 2020 Red NA NA NA 2
Feb Thu 2020 Red Warm NA NA 2
Feb Thu 2020 Red Warm Garden Run 2
Mar Thu 2020 Red Cold Desk Bus 3
我希望它看起来像这样
Month| Day| Year| Color| Weather|Location| Transporation|ID
Jan Tue 2020 Blue Warm Hospital NA 1
Feb Thu 2020 Red Warm Garden Run 2
Mar Thu 2020 Red Cold Desk Bus 3
基本上,我想通过选择三个c(ID,Month,Color)
来确定一个列是否是重复的。一旦确定了副本,我希望它删除具有最多NA或"最少完成"的副本。
也许这样可以,我使用rowsum (is.na())按行列出缺失的项目数量,然后按ID、Month、Color分组,并筛选到缺失最少的行:
library(dplyr)
dat<-data.frame("Month" = c("Jan", "Jan", "Jan", "Feb", "Feb", "Feb", "Mar"),
"Day" = c("Tue", "Tue", "Tue", "Thu", "Thu", "Thu", "Thu"),
"Year" = rep(2020,7),
"Color" = c(rep("Blue", 3), rep("Red", 4)),
"Weather" = c("Warm", "Warm", NA, NA, "Warm", "Warm", "Cold"),
"Location" = c("Hospital", rep(NA, 4), "Garden", "Desk"),
"Transporation" = c(rep(NA, 5), "Run", "Bus"),
"ID" = c(1, 1, 1, 2, 2, 2, 3)
)%>%
mutate(Missing = rowSums(is.na(.)))%>% #Making a sum of how many missing items per row
group_by(ID, Month, Color)%>%
filter(Missing == min(Missing))%>% #Filtering to the least amount of missing
ungroup()%>%
select(-Missing) #Removing the missing column as it was only used to filter
我们可以使用order
在按感兴趣的列分组后选择第一个非na元素
library(dplyr)
dat %>%
group_by(Month, Day, Year) %>%
summarise(across(everything(), ~ first(.[order(is.na(.))])), .groups = 'drop')
与产出
# A tibble: 3 x 8
Month Day Year Color Weather Location Transporation ID
<chr> <chr> <dbl> <chr> <chr> <chr> <chr> <dbl>
1 Feb Thu 2020 Red Warm Garden Run 2
2 Jan Tue 2020 Blue Warm Hospital <NA> 1
3 Mar Thu 2020 Red Cold Desk Bus 3
数据dat <- structure(list(Month = c("Jan", "Jan", "Jan", "Feb", "Feb", "Feb",
"Mar"), Day = c("Tue", "Tue", "Tue", "Thu", "Thu", "Thu", "Thu"
), Year = c(2020, 2020, 2020, 2020, 2020, 2020, 2020), Color = c("Blue",
"Blue", "Blue", "Red", "Red", "Red", "Red"), Weather = c("Warm",
"Warm", NA, NA, "Warm", "Warm", "Cold"), Location = c("Hospital",
NA, NA, NA, NA, "Garden", "Desk"), Transporation = c(NA, NA,
NA, NA, NA, "Run", "Bus"), ID = c(1, 1, 1, 2, 2, 2, 3)), class = "data.frame", row.names = c(NA,
-7L))
使用数据。表库,如果您的数据已经在j:
j <- as.data.table(your_data)
j
Month Day Year Color Weather Location Transporation ID
<char> <char> <int> <char> <char> <char> <char> <int>
1: Jan Tue 2020 Blue Warm Hospital <NA> 1
2: Jan Tue 2020 Blue Warm <NA> <NA> 1
3: Jan Tue 2020 Blue <NA> <NA> <NA> 1
4: Feb Thu 2020 Red <NA> <NA> <NA> 2
5: Feb Thu 2020 Red Warm <NA> <NA> 2
6: Feb Thu 2020 Red Warm Garden Run 2
7: Mar Thu 2020 Red Cold Desk Bus 3
j$n_na <- apply(j, MARGIN = 1, function(x) sum(is.na(x)))
setorder(j,n_na)
k <- unique(j,by=c("ID","Month","Color"))
setorder(k,ID)
k
Month Day Year Color Weather Location Transporation ID n_na
<char> <char> <int> <char> <char> <char> <char> <int> <int>
1: Jan Tue 2020 Blue Warm Hospital <NA> 1 1
2: Feb Thu 2020 Red Warm Garden Run 2 0
3: Mar Thu 2020 Red Cold Desk Bus 3 0
毕竟k将按照您的要求保存数据。问候,米格尔