r-为什么在data.table中使用辅助索引(`on`)，除非用于联接

我是data.table的忠实粉丝，但对on的使用感到困惑

这是吗

A) flights["JFK", on = "origin"]

与不同

B) flights["JFK" == origin]; # Or  flights["JFK" == get("origin")];

用前者(A(代替后者(B(的理由是什么？换言之，如果有人可以使用dt[这个="那个"]，那么引入另一种与dt["那个"，on="这个"完全相同的方法的原因是什么？从小插曲中看不出原因。

PS。我确实理解为什么on被引入用于合并文件(就像dtA[dtB, on=.(A2=B2)]中一样(。我用了所有的时间，很喜欢它，因为它使代码变得更短、更容易阅读，而且速度也很快！

data.table有很多优化。有趣的是，如果您将顺序切换到flights[origin == "JFK"]，那么data.table将在有足够行时创建索引。以下是一些使用verbose = TRUE的选项，以帮助了解正在发生的事情：

library(data.table)
flights <- fread("vignettes/flights14.csv")
## Using binary merge method explicitly
invisible(flights["JFK", on = "origin", verbose = TRUE])
## i.V1 has same type (character) as x.origin. No coercion needed.
## forder.c received 253316 rows and 11 columns
## Calculated ad hoc index in 0.010s elapsed (0.000s cpu) 
## Starting bmerge ...
## forder.c received 1 rows and 1 columns
## bmerge done in 0.000s elapsed (0.000s cpu) 
## Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu) 

## No optimization. Very quiet
invisible(flights["JFK" == origin, verbose = TRUE])

## Note an index is created and the message is similar to the first option
invisible(flights[origin == "JFK", verbose = TRUE])
## Creating new index 'origin'
## Creating index origin done in ... forder.c received 253316 rows and 11 columns
## forder took 0.09 sec
## 0.500s elapsed (0.570s cpu) 
## Optimized subsetting with index 'origin'
## forder.c received 1 rows and 1 columns
## forder took 0 sec
## x is already ordered by these columns, no need to call reorder
## i.origin has same type (character) as x.origin. No coercion needed.
## on= matches existing index, using index
## Starting bmerge ...
## bmerge done in 0.000s elapsed (0.000s cpu) 
## Constructing irows for '!byjoin || nqbyjoin' ... 0.000s elapsed (0.000s cpu)

那么，对于您的问题，为什么要使用on =联接方法呢？它将使用二进制合并来有效地找到匹配项和子集，当存在现有索引时，这可以非常快。此外，on =将不会自动创建可能需要的索引。

相关的，dt["a", on = "ID"]被翻译成dt[data.table(V1 = "a"), on = "ID"]，并进行额外的处理以帮助处理名称。换言之，这只是用户对您所喜爱的更常见的dtA[dtB, on=.(A2=B2)]的方便。

为什么要使用dt[this == "that"]，是因为代码非常直接——R中的任何人都会意识到发生了什么。此外，对于较大的data.tables，将自动创建一个新的索引，这可能是可取的。这可能是我要继续使用的代码。

相关内容

最新更新

热门标签：