我写了一个代码:
df_all<- df1 %>%
mutate(type = factor(type, levels = df3$type)) %>%
group_by(ID, date) %>%
complete(type, fill = list(value = 0)) %>%
left_join(df3)
如何将其重写为使用data.table?我是新来的,所以我不知道?如果你能帮我做这件事就太好了。
这是我使用它的原因:
我有一个具有列类型的单列数据帧df3;"类型":
comment type
used enter
used open
used close
used update
not_used delete
我从数据库中获取数据帧。但在该数据帧中;"类型";可能是。下面是该表的示例:
ID date type value
a1 2020-09-01 enter 18
a1 2020-09-01 close 15
a1 2020-09-02 enter 4
a2 2020-09-01 close 10
b1 2020-09-02 update 10
正如您所看到的,ID a1只有两种类型:输入和关闭。a2只有关闭,b1只有更新。
我想用那种方式把这两张表绑定起来,所以"类型";不在我的表中的每个ID和日期的值为零。那么,如何绑定这两个表来获得这个:
comment ID date type value
used a1 2020-09-01 enter 18
used a1 2020-09-01 open 0
used a1 2020-09-01 close 15
used a1 2020-09-01 update 0
not_used a1 2020-09-01 delete 0
used a1 2020-09-02 enter 4
used a1 2020-09-02 open 0
used a1 2020-09-02 close 0
used a1 2020-09-02 update 0
not_used a1 2020-09-02 delete 0
used a2 2020-09-01 enter 0
used a2 2020-09-01 open 0
used a2 2020-09-01 close 10
used a2 2020-09-01 update 0
not_used a2 2020-09-01 delete 0
used b1 2020-09-01 enter 0
used b1 2020-09-01 open 0
used b1 2020-09-01 close 0
used b1 2020-09-01 update 10
not_used b1 2020-09-01 delete 0
正如你所看到的,我还保留了专栏";评论";。如何将代码重写为data.table?
以下是OP代码的简明data.table
版本:
setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)]
返回预期结果(为了清楚起见,在第二步中转换value
列中的NA
s-见下文(
ID date type value comment 1: a1 2020-09-01 enter 18 used 2: a1 2020-09-01 open NA used 3: a1 2020-09-01 close 15 used 4: a1 2020-09-01 update NA used 5: a1 2020-09-01 delete NA not_used 6: a1 2020-09-02 enter 4 used 7: a1 2020-09-02 open NA used 8: a1 2020-09-02 close NA used 9: a1 2020-09-02 update NA used 10: a1 2020-09-02 delete NA not_used 11: a2 2020-09-01 enter NA used 12: a2 2020-09-01 open NA used 13: a2 2020-09-01 close 10 used 14: a2 2020-09-01 update NA used 15: a2 2020-09-01 delete NA not_used 16: b1 2020-09-02 enter NA used 17: b1 2020-09-02 open NA used 18: b1 2020-09-02 close NA used 19: b1 2020-09-02 update 10 used 20: b1 2020-09-02 delete NA not_used
对于每一组唯一的ID
、date
组合,df1
的行子集与type
上的df3
右连接,这完成了每个子集的缺失行。因为使用了right join而不是tidyr::complete((,所以这里不需要强制type
对所有因子级别进行因子化。此外,data.table
在联接期间保留了df3
行的顺序。
对于转换value
列中的NA
s,有4种不同的方法可用,它们都返回相同的结果:
setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)][is.na(value), value := 0L][]
setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)][, value := fcoalesce(value, 0L)][]
setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)][, value := nafill(value, fill = 0L)][]
setnafill(setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)], fill = 0L, cols = "value")[]
ID date type value comment 1: a1 2020-09-01 enter 18 used 2: a1 2020-09-01 open 0 used 3: a1 2020-09-01 close 15 used 4: a1 2020-09-01 update 0 used 5: a1 2020-09-01 delete 0 not_used 6: a1 2020-09-02 enter 4 used 7: a1 2020-09-02 open 0 used 8: a1 2020-09-02 close 0 used 9: a1 2020-09-02 update 0 used 10: a1 2020-09-02 delete 0 not_used 11: a2 2020-09-01 enter 0 used 12: a2 2020-09-01 open 0 used 13: a2 2020-09-01 close 10 used 14: a2 2020-09-01 update 0 used 15: a2 2020-09-01 delete 0 not_used 16: b1 2020-09-02 enter 0 used 17: b1 2020-09-02 open 0 used 18: b1 2020-09-02 close 0 used 19: b1 2020-09-02 update 10 used 20: b1 2020-09-02 delete 0 not_used
我们可以转换factor
的type
,使用CJ
(交叉联接(按ID、日期和类型展开
library(data.table)
setDT(df1)[, type := factor(type, levels = unique(df3$type))][,
CJ(ID, date, type = type, unique = TRUE)][df1,
value := value, on = .(ID, date, type)][is.na(value),
value := 0][df3, on = .(type)]
或者可以使用split
setDT(df1)[, type := factor(type, levels = unique(df3$type))]
rbindlist(lapply(split(df1, df1[, .(ID, date)], drop = TRUE),
function(x) x[, CJ(ID, date, type = levels(x$type), unique = TRUE)][x,
value := value, on = .(ID, date, type)][is.na(value), value := 0][]))[df3, on = .(type)]))
-输出
# ID date type value comment
# 1: a1 2020-09-01 enter 18 used
# 2: a1 2020-09-01 open 0 used
# 3: a1 2020-09-01 close 15 used
# 4: a1 2020-09-01 update 0 used
# 5: a1 2020-09-01 delete 0 not_used
# 6: a2 2020-09-01 enter 0 used
# 7: a2 2020-09-01 open 0 used
# 8: a2 2020-09-01 close 10 used
# 9: a2 2020-09-01 update 0 used
#10: a2 2020-09-01 delete 0 not_used
#11: a1 2020-09-02 enter 4 used
#12: a1 2020-09-02 open 0 used
#13: a1 2020-09-02 close 0 used
#14: a1 2020-09-02 update 0 used
#15: a1 2020-09-02 delete 0 not_used
#16: b1 2020-09-02 enter 0 used
#17: b1 2020-09-02 open 0 used
#18: b1 2020-09-02 close 0 used
#19: b1 2020-09-02 update 10 used
#20: b1 2020-09-02 delete 0 not_used