我想从包含标题的数据框中删除所有重复的行,而不使用包dplyr。因为我是新手,所以我需要你的帮助。
library(tibble)
library(data.table)
library(dplyr)
#create tibbles
tibble_inst <- tibble(id = c("ABC1234", "DEF123", "GHI12"),
name = c("abc inst", "def inst", "ghi inst")
)
tibble_aag1 <- tibble(id = c("AA1111", "AA2222"),
name = c("AA", "AB")
)
tibble_aag2 <- tibble(id = c("ABC1234", "DEF123", "GHI12"),
name = c("abc inst", "def inst", "ghi inst")
)
# create col with tibble
matched = list(tibble_aag1, NULL, tibble_inst, tibble_inst, tibble_inst,
tibble_aag1, tibble_aag2, NULL, NULL, tibble_inst)
# create df
dt <- data.table(
word = c("A AG", "WIL", "Inst", "Inst", "Inst",
"A AG", "A AG", "Inst", "WIL", "Inst"),
entity = c("ORG", "ORG", "ORG", "ORG", "ORG", "ORG", "ORG", "ORG", "ORG", "ORG"),
page_num = c(1,2,4,4,4,1,4,2,0,1))
dt$matched <- matched
dt
使用dplyr的代码解决方案如下:
dt1 <- dt %>% distinct(word, page_num, matched)
dt1
这不起作用,我不知道如何处理错误消息:
cols <- c("word", "page_num", "matched")
dt2 <- dt[!duplicated(dt[cols]), cols, drop = FALSE]
Error: When i is a data.table (or character vector), the columns
to join by must be specified using 'on=' argument (see ?data.table), by
keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by
sharing column names between x and i (i.e., a natural join). Keyed
joins might have further speed benefits on very large data due to x
being sorted in RAM.
我期望的结果应该是这样的:
# A tibble: 7 × 3
word page_num matched
<chr> <dbl> <list>
1 A AG 1 <tibble [2 × 2]>
2 WIL 2 <NULL>
3 Inst 4 <tibble [3 × 2]>
4 A AG 4 <tibble [3 × 2]>
5 Inst 2 <NULL>
6 WIL 0 <NULL>
7 Inst 1 <tibble [3 × 2]>
使用duplicated
,您可以:
cols <- c("word", "page_num", "matched")
dt2 <- dt[!duplicated(dt[cols]), cols, drop = FALSE]
dt2
#> # A tibble: 7 × 3
#> word page_num matched
#> <chr> <dbl> <list>
#> 1 A AG 1 <tibble [2 × 2]>
#> 2 WIL 2 <NULL>
#> 3 Inst 4 <tibble [3 × 2]>
#> 4 A AG 4 <tibble [3 × 2]>
#> 5 Inst 2 <NULL>
#> 6 WIL 0 <NULL>
#> 7 Inst 1 <tibble [3 × 2]>
library(dplyr)
dt1 <- dt %>% distinct(word, page_num, matched)
identical(dt1, dt2)
#> [1] TRUE
第一。在我看来,改变一个问题,使解决原来问题的答案变得无用,这不是一个好的做法。在这种情况下,我建议发布一个新的问题。
顺其自然。问题是,你的dt
对象现在是一个data.table
对象。这意味着dt[cols]
将不起作用。相反,我们必须这样做,例如dt[,..cols]
,选择列作为字符向量。然而,即使在修复之后,我们得到一个错误
第3列传递给[f]的顺序是类型'list',目前不支持。
我也尝试了unique(dt[, ..cols])
和dtplyr
,但总是得到同样的问题。不是使用data.table
的专家,但从错误消息中,我猜测data.table
的duplicated
方法不适用于list
列。但我可能错了。
但是一个修复方法是使用默认的duplicated.data.frame
方法:
library(data.table)
cols <- c("word", "page_num", "matched")
dt[!duplicated.data.frame(dt[, ..cols]), ..cols, drop = FALSE]
#> word page_num matched
#> 1: A AG 1 <tbl_df[2x2]>
#> 2: WIL 2
#> 3: Inst 4 <tbl_df[3x2]>
#> 4: A AG 4 <tbl_df[3x2]>
#> 5: Inst 2
#> 6: WIL 0
#> 7: Inst 1 <tbl_df[3x2]>