我是data.table
的忠实粉丝,但对on
的使用感到困惑
这是吗
A) flights["JFK", on = "origin"]
与不同
B) flights["JFK" == origin]; # Or flights["JFK" == get("origin")];
用前者(A(代替后者(B(的理由是什么?换言之,如果有人可以使用dt[这个="那个"],那么引入另一种与dt["那个",on="这个"完全相同的方法的原因是什么?从小插曲中看不出原因。
PS。我确实理解为什么on
被引入用于合并文件(就像dtA[dtB, on=.(A2=B2)]
中一样(。我用了所有的时间,很喜欢它,因为它使代码变得更短、更容易阅读,而且速度也很快!
data.table有很多优化。有趣的是,如果您将顺序切换到flights[origin == "JFK"]
,那么data.table将在有足够行时创建索引。以下是一些使用verbose = TRUE
的选项,以帮助了解正在发生的事情:
library(data.table)
flights <- fread("vignettes/flights14.csv")
## Using binary merge method explicitly
invisible(flights["JFK", on = "origin", verbose = TRUE])
## i.V1 has same type (character) as x.origin. No coercion needed.
## forder.c received 253316 rows and 11 columns
## Calculated ad hoc index in 0.010s elapsed (0.000s cpu)
## Starting bmerge ...
## forder.c received 1 rows and 1 columns
## bmerge done in 0.000s elapsed (0.000s cpu)
## Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
## No optimization. Very quiet
invisible(flights["JFK" == origin, verbose = TRUE])
## Note an index is created and the message is similar to the first option
invisible(flights[origin == "JFK", verbose = TRUE])
## Creating new index 'origin'
## Creating index origin done in ... forder.c received 253316 rows and 11 columns
## forder took 0.09 sec
## 0.500s elapsed (0.570s cpu)
## Optimized subsetting with index 'origin'
## forder.c received 1 rows and 1 columns
## forder took 0 sec
## x is already ordered by these columns, no need to call reorder
## i.origin has same type (character) as x.origin. No coercion needed.
## on= matches existing index, using index
## Starting bmerge ...
## bmerge done in 0.000s elapsed (0.000s cpu)
## Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)
那么,对于您的问题,为什么要使用on =
联接方法呢?它将使用二进制合并来有效地找到匹配项和子集,当存在现有索引时,这可以非常快。此外,on =
将不会自动创建可能需要的索引。
相关的,dt["a", on = "ID"]
被翻译成dt[data.table(V1 = "a"), on = "ID"]
,并进行额外的处理以帮助处理名称。换言之,这只是用户对您所喜爱的更常见的dtA[dtB, on=.(A2=B2)]
的方便。
为什么要使用dt[this == "that"]
,是因为代码非常直接——R中的任何人都会意识到发生了什么。此外,对于较大的data.tables,将自动创建一个新的索引,这可能是可取的。这可能是我要继续使用的代码。