r-如何将我的代码从使用tidyverse重写为data.table



我写了一个代码:

df_all<- df1 %>%
mutate(type = factor(type, levels = df3$type)) %>%
group_by(ID, date) %>%
complete(type, fill = list(value = 0)) %>%
left_join(df3)

如何将其重写为使用data.table?我是新来的,所以我不知道?如果你能帮我做这件事就太好了。

这是我使用它的原因:

我有一个具有列类型的单列数据帧df3;"类型":

comment       type
used         enter 
used         open
used         close
used         update
not_used     delete

我从数据库中获取数据帧。但在该数据帧中;"类型";可能是。下面是该表的示例:

ID    date            type           value
a1    2020-09-01       enter          18
a1    2020-09-01       close          15
a1    2020-09-02       enter          4
a2    2020-09-01       close          10
b1    2020-09-02       update         10

正如您所看到的,ID a1只有两种类型:输入和关闭。a2只有关闭,b1只有更新。

我想用那种方式把这两张表绑定起来,所以"类型";不在我的表中的每个ID和日期的值为零。那么,如何绑定这两个表来获得这个:

comment            ID    date            type           value
used               a1    2020-09-01       enter          18
used               a1    2020-09-01       open           0
used               a1    2020-09-01       close          15
used               a1    2020-09-01       update         0
not_used           a1    2020-09-01       delete         0
used               a1    2020-09-02       enter          4
used               a1    2020-09-02       open           0
used               a1    2020-09-02       close          0
used               a1    2020-09-02       update         0
not_used           a1    2020-09-02       delete         0
used               a2    2020-09-01       enter          0
used               a2    2020-09-01       open           0
used               a2    2020-09-01       close          10
used               a2    2020-09-01       update         0
not_used           a2    2020-09-01       delete         0
used               b1    2020-09-01       enter          0
used               b1    2020-09-01       open           0
used               b1    2020-09-01       close          0
used               b1    2020-09-01       update         10
not_used           b1    2020-09-01       delete         0

正如你所看到的,我还保留了专栏";评论";。如何将代码重写为data.table?

以下是OP代码的简明data.table版本:

setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)]

返回预期结果(为了清楚起见,在第二步中转换value列中的NAs-见下文(

ID       date   type value  comment
1: a1 2020-09-01  enter    18     used
2: a1 2020-09-01   open    NA     used
3: a1 2020-09-01  close    15     used
4: a1 2020-09-01 update    NA     used
5: a1 2020-09-01 delete    NA not_used
6: a1 2020-09-02  enter     4     used
7: a1 2020-09-02   open    NA     used
8: a1 2020-09-02  close    NA     used
9: a1 2020-09-02 update    NA     used
10: a1 2020-09-02 delete    NA not_used
11: a2 2020-09-01  enter    NA     used
12: a2 2020-09-01   open    NA     used
13: a2 2020-09-01  close    10     used
14: a2 2020-09-01 update    NA     used
15: a2 2020-09-01 delete    NA not_used
16: b1 2020-09-02  enter    NA     used
17: b1 2020-09-02   open    NA     used
18: b1 2020-09-02  close    NA     used
19: b1 2020-09-02 update    10     used
20: b1 2020-09-02 delete    NA not_used

对于每一组唯一的IDdate组合,df1的行子集与type上的df3右连接,这完成了每个子集的缺失行。因为使用了right join而不是tidyr::complete((,所以这里不需要强制type对所有因子级别进行因子化。此外,data.table在联接期间保留了df3行的顺序。

对于转换value列中的NAs,有4种不同的方法可用,它们都返回相同的结果:

setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)][is.na(value), value := 0L][]
setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)][, value := fcoalesce(value, 0L)][]
setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)][, value := nafill(value, fill = 0L)][]
setnafill(setDT(df1)[, .SD[df3, on = .(type)], by = .(ID, date)], fill = 0L, cols = "value")[]
ID       date   type value  comment
1: a1 2020-09-01  enter    18     used
2: a1 2020-09-01   open     0     used
3: a1 2020-09-01  close    15     used
4: a1 2020-09-01 update     0     used
5: a1 2020-09-01 delete     0 not_used
6: a1 2020-09-02  enter     4     used
7: a1 2020-09-02   open     0     used
8: a1 2020-09-02  close     0     used
9: a1 2020-09-02 update     0     used
10: a1 2020-09-02 delete     0 not_used
11: a2 2020-09-01  enter     0     used
12: a2 2020-09-01   open     0     used
13: a2 2020-09-01  close    10     used
14: a2 2020-09-01 update     0     used
15: a2 2020-09-01 delete     0 not_used
16: b1 2020-09-02  enter     0     used
17: b1 2020-09-02   open     0     used
18: b1 2020-09-02  close     0     used
19: b1 2020-09-02 update    10     used
20: b1 2020-09-02 delete     0 not_used

我们可以转换factortype,使用CJ(交叉联接(按ID、日期和类型展开

library(data.table)
setDT(df1)[, type := factor(type, levels = unique(df3$type))][,
CJ(ID, date, type = type, unique = TRUE)][df1,
value := value, on = .(ID, date, type)][is.na(value),
value := 0][df3, on = .(type)]

或者可以使用split

setDT(df1)[, type := factor(type, levels = unique(df3$type))]
rbindlist(lapply(split(df1, df1[, .(ID, date)], drop = TRUE), 
function(x) x[, CJ(ID, date, type = levels(x$type), unique = TRUE)][x, 
value := value, on = .(ID, date, type)][is.na(value), value := 0][]))[df3, on = .(type)]))

-输出

#    ID       date   type value  comment
# 1: a1 2020-09-01  enter    18     used
# 2: a1 2020-09-01   open     0     used
# 3: a1 2020-09-01  close    15     used
# 4: a1 2020-09-01 update     0     used
# 5: a1 2020-09-01 delete     0 not_used
# 6: a2 2020-09-01  enter     0     used
# 7: a2 2020-09-01   open     0     used
# 8: a2 2020-09-01  close    10     used
# 9: a2 2020-09-01 update     0     used
#10: a2 2020-09-01 delete     0 not_used
#11: a1 2020-09-02  enter     4     used
#12: a1 2020-09-02   open     0     used
#13: a1 2020-09-02  close     0     used
#14: a1 2020-09-02 update     0     used
#15: a1 2020-09-02 delete     0 not_used
#16: b1 2020-09-02  enter     0     used
#17: b1 2020-09-02   open     0     used
#18: b1 2020-09-02  close     0     used
#19: b1 2020-09-02 update    10     used
#20: b1 2020-09-02 delete     0 not_used

最新更新