我正在使用R,需要一些帮助。我有一个巨大的数据集,id被分配给多个类别。我想删除行,这样id只有一个类别。这里的逻辑是:如果id只分配给列"中的一个类别;分配的";不删除任何行(参见id 2(,但是当id被分配给列"中的多个类别时;分配的";移除其具有"0"的行;香蕉;除非有更多的行带有";香蕉;而不是用";苹果;(参见id 3(。如果我能在一个单独的数据帧中获得我删除的记录,那将是非常棒的。
这是我的
df <- data.frame(ID=c(1,1,1,2,2,3,3,3,4,4),
Date=c("04-01-2020", "05-01-2020", "06-01-2020", "04-01-2020",
"05-01-2020", "04-01-2020", "05-01-2020", "06-01-2020",
"05-01-2020", "06-01-20202"),
Assigned = c("Apple", "Banana","Apple","Apple", "Apple",
"Apple","Banana","Banana", "Apple", "Banana"))
这是我想要的
df1 <- data.frame(ID=c(1,1,2,2,3,3,4),
Date=c("04-01-2020", "06-01-2020", "04-01-2020",
"05-01-2020", "05-01-2020", "06-01-2020",
"05-01-2020"),
Assigned = c("Apple", "Apple","Apple", "Apple",
"Banana","Banana", "Apple"))
此外,如何轻松获取ID具有多个类别的行(获取ID 2以外的所有行(。提前谢谢。
我们可以使用Mode
函数并通过filter
进行分组
library(dplyr)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
df %>%
group_by(ID) %>%
filter(Assigned == Mode(Assigned))
# A tibble: 7 x 3
# Groups: ID [4]
# ID Date Assigned
# <dbl> <chr> <chr>
#1 1 04-01-2020 Apple
#2 1 06-01-2020 Apple
#3 2 04-01-2020 Apple
#4 2 05-01-2020 Apple
#5 3 05-01-2020 Banana
#6 3 06-01-2020 Banana
#7 4 05-01-2020 Apple
或以更紧凑的方式
library(tibble)
df %>%
# // creates a frequency column based on ID, Assigned
add_count(ID, Assigned) %>%
# // arrange by ID, Assigned
arrange(ID, Assigned) %>%
# // grouped by ID
group_by(ID) %>%
# // create a logical vector where the first 'Assigned' value
# // with the max frequency
filter(Assigned == Assigned[which.max(n)]) %>%
select(-n)
# A tibble: 7 x 3
# Groups: ID [4]
# ID Date Assigned
# <dbl> <chr> <chr>
#1 1 04-01-2020 Apple
#2 1 06-01-2020 Apple
#3 2 04-01-2020 Apple
#4 2 05-01-2020 Apple
#5 3 05-01-2020 Banana
#6 3 06-01-2020 Banana
#7 4 05-01-2020 Apple
您可以在这里使用split-apply-bind
技术来获得守门员
do.call(rbind, lapply(split(df, df$ID), function(x) {
x[x$Assigned == names(which.max(table(x$Assigned))),]
}))
#> ID Date Assigned
#> 1.1 1 04-01-2020 Apple
#> 1.3 1 06-01-2020 Apple
#> 2.4 2 04-01-2020 Apple
#> 2.5 2 05-01-2020 Apple
#> 3.7 3 05-01-2020 Banana
#> 3.8 3 06-01-2020 Banana
#> 4 4 05-01-2020 Apple
以及拒绝:
do.call(rbind, lapply(split(df, df$ID), function(x) {
x[x$Assigned != names(which.max(table(x$Assigned))),]
}))
#> ID Date Assigned
#> 1 1 05-01-2020 Banana
#> 3 3 04-01-2020 Apple
#> 4 4 06-01-20202 Banana
以及每个ID具有一个以上类别的行的索引:
as.integer(which(do.call(c, lapply(split(df, df$ID), function(x) {
rep(length(table(x$Assigned)) > 1, nrow(x))
}))))
#> [1] 1 2 3 6 7 8 9 10