r语言 - 基于两个 data.frames/data.tables 计算因子水平上的新列



我正在尝试为 data.tabledt计算新列的值。计算的一部分来自data.framedf(也可以是一个data.table,到目前为止我不需要它(。

如果因子水平(此处:sample(匹配,如何使用来自两个不同对象的值来计算新列?我曾经合并两个对象并逐行进行,但这会导致大量冗余数据。

这是 data.frame,它只有 10 行:

df
sample scaling_factor
A1      A1      111956565
A2      A2       89869320
A3      A3      120925219
A4      A4      111757559
A5      A5       77319341
A6      A6       89403194
A7      A7      150214981
B8      B8      133885925
B9      B9       86536587
B10    B10      123574939

df <- structure(list(sample = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 
9L, 10L, 8L), .Label = c("A1", "A2", "A3", "A4", "A5", "A6", 
"A7", "B10", "B8", "B9"), class = "factor"), scaling_factor = c(111956565.427018, 
89869319.9348599, 120925219.4453, 111757558.886234, 77319340.5841949, 
89403194.1170576, 150214980.784589, 133885925.080984, 86536586.7136393, 
123574939.026597)), .Names = c("sample", "scaling_factor"), class = "data.frame", row.names = c("A1", 
"A2", "A3", "A4", "A5", "A6", "A7", "B8", "B9", "B10"))

这是 data.table,每个样本有数十万行(dput 在输出中遇到<问题,所以这里不提供(:

setDT(dt)
sample     contig_id product_reads_rpk
1:     A1     contig_10        2000.00000
2:     A1    contig_100          24.27184
3:     A1   contig_1000        1713.90374
4:     A1  contig_10000        2900.66225
5:     A1 contig_100003        1713.94231
6:     A1 contig_100004        8575.23511
7:     A1 contig_100004       11059.32203
8:     A2 contig_100009        6923.67400
9:     A2 contig_100010        1285.30259
10:     A2 contig_100015          84.74576
dt[,product_rpm := product_reads_rpk/(df$scaling_factor/1000000), by = sample]

我正在尝试根据df中每个样本的相应值在 dt 中生成一个新的列product_rpm.我该怎么做?我得到了longer object length is not a multiple of shorter object length但较短的对象长度是 1,例如A1在DF中,对吧?

我不知道在不实际合并两个数据集的情况下执行此操作的方法 - 但是如果您使用合并数据集的data.table方式,则可以避免创建冗余列。

因此,在您的情况下,它只是:

df <- data.table(df)
dt[df, product_rpm := (product_reads_rpk/scaling_factor/1000000), on = "sample"]

一个简单的例子:

library(data.table)
dt1 <- data.table(id = sample(1000:9999, size = 100),
size = sample(10000:99999, size = 100))
dt2 <- data.table(id = rep(dt1$id, 10), 
group = rep(LETTERS[1:5], 10),
value = sample(1000:9999, size = 100 * 10, replace = T))
dt3 <- dt2[dt1, metric:= (value / size), on = "id"]
head(dt3)

最新更新