r语言 - 删除df中包含标题的重复行;明确不使用dplyr



我想从包含标题的数据框中删除所有重复的行,而不使用包dplyr。因为我是新手,所以我需要你的帮助。

library(tibble)
library(data.table)
library(dplyr)
#create tibbles
tibble_inst <- tibble(id = c("ABC1234", "DEF123", "GHI12"),
name = c("abc inst", "def inst", "ghi inst")
)
tibble_aag1 <- tibble(id = c("AA1111", "AA2222"),
name = c("AA", "AB")
)
tibble_aag2 <- tibble(id = c("ABC1234", "DEF123", "GHI12"),
name = c("abc inst", "def inst", "ghi inst")
)
# create col with tibble
matched = list(tibble_aag1, NULL, tibble_inst, tibble_inst, tibble_inst, 
tibble_aag1, tibble_aag2, NULL, NULL, tibble_inst)
# create df
dt <- data.table(
word = c("A AG", "WIL", "Inst", "Inst", "Inst", 
"A AG", "A AG", "Inst", "WIL", "Inst"),
entity = c("ORG", "ORG", "ORG", "ORG", "ORG", "ORG", "ORG", "ORG", "ORG", "ORG"),
page_num = c(1,2,4,4,4,1,4,2,0,1))
dt$matched <- matched
dt

使用dplyr的代码解决方案如下:

dt1 <- dt %>% distinct(word, page_num, matched)
dt1

这不起作用,我不知道如何处理错误消息:

cols <- c("word", "page_num", "matched")
dt2 <- dt[!duplicated(dt[cols]), cols, drop = FALSE]
Error: When i is a data.table (or character vector), the columns 
to join by must be specified using 'on=' argument (see ?data.table), by
keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by 
sharing column names between x and i (i.e., a natural join). Keyed 
joins might have further speed benefits on very large data due to x 
being sorted in RAM.

我期望的结果应该是这样的:

# A tibble: 7 × 3
word  page_num matched         
<chr>    <dbl> <list>          
1 A AG         1 <tibble [2 × 2]>
2 WIL          2 <NULL>          
3 Inst         4 <tibble [3 × 2]>
4 A AG         4 <tibble [3 × 2]>
5 Inst         2 <NULL>          
6 WIL          0 <NULL>          
7 Inst         1 <tibble [3 × 2]>

使用duplicated,您可以:

cols <- c("word", "page_num", "matched")
dt2 <- dt[!duplicated(dt[cols]), cols, drop = FALSE]
dt2
#> # A tibble: 7 × 3
#>   word  page_num matched         
#>   <chr>    <dbl> <list>          
#> 1 A AG         1 <tibble [2 × 2]>
#> 2 WIL          2 <NULL>          
#> 3 Inst         4 <tibble [3 × 2]>
#> 4 A AG         4 <tibble [3 × 2]>
#> 5 Inst         2 <NULL>          
#> 6 WIL          0 <NULL>          
#> 7 Inst         1 <tibble [3 × 2]>
library(dplyr)
dt1 <- dt %>% distinct(word, page_num, matched)
identical(dt1, dt2)
#> [1] TRUE

第一。在我看来,改变一个问题,使解决原来问题的答案变得无用,这不是一个好的做法。在这种情况下,我建议发布一个新的问题。

顺其自然。问题是,你的dt对象现在是一个data.table对象。这意味着dt[cols]将不起作用。相反,我们必须这样做,例如dt[,..cols],选择列作为字符向量。然而,即使在修复之后,我们得到一个错误

第3列传递给[f]的顺序是类型'list',目前不支持。

我也尝试了unique(dt[, ..cols])dtplyr,但总是得到同样的问题。不是使用data.table的专家,但从错误消息中,我猜测data.tableduplicated方法不适用于list列。但我可能错了。

但是一个修复方法是使用默认的duplicated.data.frame方法:

library(data.table)
cols <- c("word", "page_num", "matched")
dt[!duplicated.data.frame(dt[, ..cols]), ..cols, drop = FALSE]
#>    word page_num       matched
#> 1: A AG        1 <tbl_df[2x2]>
#> 2:  WIL        2              
#> 3: Inst        4 <tbl_df[3x2]>
#> 4: A AG        4 <tbl_df[3x2]>
#> 5: Inst        2              
#> 6:  WIL        0              
#> 7: Inst        1 <tbl_df[3x2]>

最新更新