r-关联问题导致超过2^31行



我有两个巨大的数据帧:

> dim(res)
[1] 111478253         8
> dim(asign)
[1] 107371528         5

我想通过"chr"one_answers"pos"合并它们

> head(res)
chr   pos a1 a2  a3         variant_id pval_nominal           gene_id
1: chr1 54490  G  A b38 chr1_54490_G_A_b38     0.608495 ENSG00000227232.5
2: chr1 58814  G  A b38 chr1_58814_G_A_b38     0.295211 ENSG00000227232.5
3: chr1 60351  A  G b38 chr1_60351_A_G_b38     0.439788 ENSG00000227232.5
4: chr1 61920  G  A b38 chr1_61920_G_A_b38     0.319528 ENSG00000227232.5
5: chr1 63671  G  A b38 chr1_63671_G_A_b38     0.237739 ENSG00000227232.5
6: chr1 64931  G  A b38 chr1_64931_G_A_b38     0.276679 ENSG00000227232.5
> head(asign)
gene  chr                chr_pos   pos p.val.Retina
1: ENSG00000227232 chr1           1:10177:A:AC 10177     0.381708
2: ENSG00000227232 chr1 rs145072688:10352:T:TA 10352     0.959523
3: ENSG00000227232 chr1            1:11008:C:G 11008     0.218132
4: ENSG00000227232 chr1            1:11012:C:G 11012     0.218132
5: ENSG00000227232 chr1            1:13110:G:A 13110     0.998262
6: ENSG00000227232 chr1  rs201725126:13116:T:G 13116     0.438572
> m=merge(res, asign, by = c("chr", "pos"))
Error in vecseq(f__, len__, if (allow.cartesian || notjoin ||    !anyDuplicated(f__,  : 
Join results in more than 2^31 rows (internal vecseq reached   physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

我尝试使用by=.EACHI,但得到了相同的错误。

I最终合并的文件我只需要保持匹配:"chr","pos","pval_normal","p.val.Retina">

我只需要"res"one_answers"asign"数据帧之间的公共行。

我可以从这两个数据帧中删除我不需要的列,我得到了这个:

> head(asignx)
chr   pos     p.val.Retina
1: chr1 10177     0.381708
2: chr1 10352     0.959523
3: chr1 11008     0.218132
4: chr1 11012     0.218132
5: chr1 13110     0.998262
6: chr1 13116     0.438572
> head(l4x)
chr   pos     pval_nominal
1: chr1 13550     0.375614
2: chr1 14671     0.474708
3: chr1 14677     0.699887
4: chr1 16841     0.127895
5: chr1 16856     0.627822
6: chr1 17005     0.802803

但当我尝试合并这些时:

> m=merge(l4x,asignx, by = c("chr", "pos"),all.x=FALSE,all.y=FALSE)
Error in vecseq(f__, len__, if (allow.cartesian || notjoin ||   !anyDuplicated(f__,  : 
Join results in more than 2^31 rows (internal vecseq reached physical    limit)

假设您的两个数据帧都加载到一个数据库中(您必须设置一个类似postgres或sql server的数据库(,则sql等效于:

m=merge(res, asign, by = c("chr", "pos"))

是否

select *
into m_table
from res 
join asign
on res.chr = asign.chr
and res.pos = asign.pos

然后你会有一个新的表格:

select *
from m_table;

最新更新