r-为什么在data.table中使用辅助索引(`on`),除非用于联接



我是data.table的忠实粉丝,但对on的使用感到困惑

这是吗

A) flights["JFK", on = "origin"]

与不同

B) flights["JFK" == origin]; # Or  flights["JFK" == get("origin")];

用前者(A(代替后者(B(的理由是什么?换言之,如果有人可以使用dt[这个="那个"],那么引入另一种与dt["那个",on="这个"完全相同的方法的原因是什么?从小插曲中看不出原因。

PS。我确实理解为什么on被引入用于合并文件(就像dtA[dtB, on=.(A2=B2)]中一样(。我用了所有的时间,很喜欢它,因为它使代码变得更短、更容易阅读,而且速度也很快!

data.table有很多优化。有趣的是,如果您将顺序切换到flights[origin == "JFK"],那么data.table将在有足够行时创建索引。以下是一些使用verbose = TRUE的选项,以帮助了解正在发生的事情:

library(data.table)
flights <- fread("vignettes/flights14.csv")
## Using binary merge method explicitly
invisible(flights["JFK", on = "origin", verbose = TRUE])
## i.V1 has same type (character) as x.origin. No coercion needed.
## forder.c received 253316 rows and 11 columns
## Calculated ad hoc index in 0.010s elapsed (0.000s cpu) 
## Starting bmerge ...
## forder.c received 1 rows and 1 columns
## bmerge done in 0.000s elapsed (0.000s cpu) 
## Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 

## No optimization. Very quiet
invisible(flights["JFK" == origin, verbose = TRUE])

## Note an index is created and the message is similar to the first option
invisible(flights[origin == "JFK", verbose = TRUE])
## Creating new index 'origin'
## Creating index origin done in ... forder.c received 253316 rows and 11 columns
## forder took 0.09 sec
## 0.500s elapsed (0.570s cpu) 
## Optimized subsetting with index 'origin'
## forder.c received 1 rows and 1 columns
## forder took 0 sec
## x is already ordered by these columns, no need to call reorder
## i.origin has same type (character) as x.origin. No coercion needed.
## on= matches existing index, using index
## Starting bmerge ...
## bmerge done in 0.000s elapsed (0.000s cpu) 
## Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 

那么,对于您的问题,为什么要使用on =联接方法呢?它将使用二进制合并来有效地找到匹配项和子集,当存在现有索引时,这可以非常快。此外,on =将不会自动创建可能需要的索引。

相关的,dt["a", on = "ID"]被翻译成dt[data.table(V1 = "a"), on = "ID"],并进行额外的处理以帮助处理名称。换言之,这只是用户对您所喜爱的更常见的dtA[dtB, on=.(A2=B2)]的方便。

为什么要使用dt[this == "that"],是因为代码非常直接——R中的任何人都会意识到发生了什么。此外,对于较大的data.tables,将自动创建一个新的索引,这可能是可取的。这可能是我要继续使用的代码。

最新更新