我有一个data.frame
,其中包含 100 个 id(例如基因(的值,从 10group
秒(例如细胞类型(开始测量,其中每个group
来自 10family
秒(例如组织(,每个这样的id
3 个样本 -group
-family
组合,即总共 30000 行:
set.seed(1)
df <- data.frame(id = rep(paste0("i",1:100),300),
group = rep(unlist(lapply(1:10,function(g) rep(paste0("g",g),100))),30),
family = unlist(lapply(1:10,function(f) rep(paste0("f",f),3000))),
val = rnorm(30000))
我想创建一个data.frame
,对于每个family
中每个group
中的每个id
,计算其平均val
与该group
和family
的所有其他id
的平均val
s之间的倍数变化。
这是我现在正在做的事情,但我正在寻找一个更快的实现,这可能可以通过dplyr
来实现:
ids <- paste0("i",1:100)
groups <- paste0("g",1:10)
families <- paste0("f",1:10)
res.df <- do.call(rbind,lapply(ids,function(i){
do.call(rbind,lapply(families,function(f){
do.call(rbind,lapply(groups,function(g){
data.frame(id=i,group=g,family=f,fc=mean(dplyr::filter(df,id == i,group == g,family == f)$val)/mean(dplyr::filter(df,id != i,group == g,family == f)$val))
}))
}))
}))
知道吗?
我同意@PoGibas关于您的问题缺乏清晰度的观点,只是假设您尝试以有效的方式重现最终数据框res.df
。在这一点上,我相信@PoGibas的答案并没有给出你想要的格式,有些人可能会发现data.table
的语法与dplyr
相比更难理解(我并不是要比较它们,这两个包都有自己的优势(。这是一种可能dplyr
解决方案:
library(dplyr)
# assuming that df and res.df are already loaded as given in the question
by_id_group_family <- df %>%
# group by id, group and family
group_by(id, group, family) %>%
# calculate some useful features of the grouped data
summarise(
count = n(),
total_val = sum(val),
avg_val = mean(val)
)
by_group_family <- df %>%
# group by group and family
group_by(group, family) %>%
# calculate some useful features of the grouped data
summarise(
count = n(),
total_val = sum(val),
avg_val = mean(val)
)
# store mean vals for each id samples in each group in each family
mean_ids <- by_id_group_family$avg_val
# compute mean vals of all other ids in each group in each family
# note that shorter list will recycle here
# and we have a minus at the beginning as we are subtracting bigger sum from smaller one
mean_other_ids <- -(by_id_group_family$total_val - by_group_family$total_val) / 297
# computing the ratio of means
ratio <- mean_ids / mean_other_ids
# combining the ratio with the grouped data
result <- by_id_group_family %>%
# choose only the first three columns
select(1:3) %>%
ungroup() %>%
# add a new column to store ratio
mutate(fc = ratio)
# note that result has the same info as your res.df but family column is sorted differently
head(result)
# # A tibble: 6 x 4
# id group family fc
# <fct> <fct> <fct> <dbl>
# 1 i1 g1 f1 9.48
# 2 i1 g1 f10 -4.86
# 3 i1 g1 f2 -50.4
# 4 i1 g1 f3 17.2
# 5 i1 g1 f4 131.
# 6 i1 g1 f5 4.03
可以通过组合一些步骤和删除额外的计算来使代码更简洁,但我认为这种方式更容易遵循,额外的摘要统计数据可以帮助我了解数据的性质。
简短回答:
library(data.table)
dfM <- setDT(df)[, mean(val), .(id, group, family)]
cbind(dfM[, outer(V1, V1, "/"), .(group, family)],
dfM[, expand.grid(id, id), .(group, family)][, .(Var1, Var2)])
解释:
我会以不同的方式解决此任务(无需迭代(。首先,我们必须澄清您的问题:
- 计算每个
id
、group
和family
组合的val
平均值 - 将每个
group
和family
组合的每个均值乘积除以其他平均乘积
为了计算平均值,我将使用data.table
(我也使用data.table
进行以后每组的计算(,想法不是多次重新计算平均值。
library(data.table)
dfM <- setDT(df)[, mean(val), .(id, group, family)]
# Result
# head(dfM)
# id group family V1
# 1: i1 g1 f1 -0.12587944
# 2: i2 g1 f1 -0.20889324
# 3: i3 g1 f1 -0.02890183
# 4: i4 g1 f1 0.77509410
# 5: i5 g1 f1 0.11435116
# 6: i6 g1 f1 -0.59556654
要计算倍数变化(即,向量除以(,我们可以使用outer
函数。在这里,我们要求将向量V1
除以向量V1
data.table
dfM
内按每个group
和family
组合。
foo <- dfM[, outer(V1, V1, "/"), .(group, family)]
# nrow(foo)
# 1000000
# group family V1
# 1: g1 f1 1.0000000
# 2: g1 f1 1.6594708
# 3: g1 f1 0.2295993
# 4: g1 f1 -6.1574322
# 5: g1 f1 -0.9084181
# 6: g1 f1 4.7312457
outer
没有给我们提供有关id
的信息,因为我们使用另一个base
R函数expand.grid
。
bar <- dfM[, expand.grid(id, id), .(group, family)][, .(id1 = Var1, id2 = Var2)]
对于最终结果,请使用cbind
:
head(cbind(foo, bar))
head(cbind(foo, bar))
# group family V1 id1 id2
# 1: g1 f1 1.0000000 i1 i1
# 2: g1 f1 1.6594708 i2 i1
# 3: g1 f1 0.2295993 i3 i1
# 4: g1 f1 -6.1574322 i4 i1
# 5: g1 f1 -0.9084181 i5 i1
# 6: g1 f1 4.7312457 i6 i1
对于给定的OP数据,此解决方案只需几秒钟。
数据:
set.seed(1)
df <- data.frame(id = rep(paste0("i",1:100),300),
group = rep(unlist(lapply(1:10,function(g) rep(paste0("g",g),100))),30),
family = unlist(lapply(1:10,function(f) rep(paste0("f",f),3000))),
val = rnorm(30000))