id被分配给多个类别如何删除行,因此id在r中只有一个类别

  • 本文关键字:id 因此 有一个 分配 删除行 r
  • 更新时间 :
  • 英文 :


我正在使用R,需要一些帮助。我有一个巨大的数据集,id被分配给多个类别。我想删除行,这样id只有一个类别。这里的逻辑是:如果id只分配给列"中的一个类别;分配的";不删除任何行(参见id 2(,但是当id被分配给列"中的多个类别时;分配的";移除其具有"0"的行;香蕉;除非有更多的行带有";香蕉;而不是用";苹果;(参见id 3(。如果我能在一个单独的数据帧中获得我删除的记录,那将是非常棒的。

这是我的

df <- data.frame(ID=c(1,1,1,2,2,3,3,3,4,4), 
Date=c("04-01-2020", "05-01-2020", "06-01-2020", "04-01-2020",
"05-01-2020", "04-01-2020", "05-01-2020", "06-01-2020",
"05-01-2020", "06-01-20202"),
Assigned = c("Apple", "Banana","Apple","Apple", "Apple",
"Apple","Banana","Banana", "Apple", "Banana"))

这是我想要的

df1 <-  data.frame(ID=c(1,1,2,2,3,3,4), 
Date=c("04-01-2020", "06-01-2020", "04-01-2020",
"05-01-2020", "05-01-2020", "06-01-2020",
"05-01-2020"),
Assigned = c("Apple", "Apple","Apple", "Apple",
"Banana","Banana", "Apple"))

此外,如何轻松获取ID具有多个类别的行(获取ID 2以外的所有行(。提前谢谢。

我们可以使用Mode函数并通过filter进行分组

library(dplyr)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
df %>% 
group_by(ID) %>% 
filter(Assigned == Mode(Assigned))
# A tibble: 7 x 3
# Groups:   ID [4]
#     ID Date       Assigned
#  <dbl> <chr>      <chr>   
#1     1 04-01-2020 Apple   
#2     1 06-01-2020 Apple   
#3     2 04-01-2020 Apple   
#4     2 05-01-2020 Apple   
#5     3 05-01-2020 Banana  
#6     3 06-01-2020 Banana  
#7     4 05-01-2020 Apple

或以更紧凑的方式

library(tibble)
df %>%
# // creates a frequency column based on ID, Assigned
add_count(ID, Assigned) %>% 
# // arrange by ID, Assigned
arrange(ID, Assigned) %>%
# // grouped by ID
group_by(ID) %>% 
# // create a logical vector where the first 'Assigned' value 
# // with the max frequency
filter(Assigned == Assigned[which.max(n)])   %>%
select(-n) 
# A tibble: 7 x 3
# Groups:   ID [4]
#     ID Date       Assigned
#  <dbl> <chr>      <chr>   
#1     1 04-01-2020 Apple   
#2     1 06-01-2020 Apple   
#3     2 04-01-2020 Apple   
#4     2 05-01-2020 Apple   
#5     3 05-01-2020 Banana  
#6     3 06-01-2020 Banana  
#7     4 05-01-2020 Apple   

您可以在这里使用split-apply-bind技术来获得守门员

do.call(rbind, lapply(split(df, df$ID), function(x) {
x[x$Assigned == names(which.max(table(x$Assigned))),]
}))
#>     ID       Date Assigned
#> 1.1  1 04-01-2020    Apple
#> 1.3  1 06-01-2020    Apple
#> 2.4  2 04-01-2020    Apple
#> 2.5  2 05-01-2020    Apple
#> 3.7  3 05-01-2020   Banana
#> 3.8  3 06-01-2020   Banana
#> 4    4 05-01-2020    Apple

以及拒绝:

do.call(rbind, lapply(split(df, df$ID), function(x) {
x[x$Assigned != names(which.max(table(x$Assigned))),]
}))
#>   ID        Date Assigned
#> 1  1  05-01-2020   Banana
#> 3  3  04-01-2020    Apple
#> 4  4 06-01-20202   Banana

以及每个ID具有一个以上类别的行的索引:

as.integer(which(do.call(c, lapply(split(df, df$ID), function(x) {
rep(length(table(x$Assigned)) > 1, nrow(x))
}))))
#> [1]  1  2  3  6  7  8  9 10

相关内容

最新更新