我有两个巨大的数据帧:
> dim(res)
[1] 111478253 8
> dim(asign)
[1] 107371528 5
我想通过"chr"one_answers"pos"合并它们
> head(res)
chr pos a1 a2 a3 variant_id pval_nominal gene_id
1: chr1 54490 G A b38 chr1_54490_G_A_b38 0.608495 ENSG00000227232.5
2: chr1 58814 G A b38 chr1_58814_G_A_b38 0.295211 ENSG00000227232.5
3: chr1 60351 A G b38 chr1_60351_A_G_b38 0.439788 ENSG00000227232.5
4: chr1 61920 G A b38 chr1_61920_G_A_b38 0.319528 ENSG00000227232.5
5: chr1 63671 G A b38 chr1_63671_G_A_b38 0.237739 ENSG00000227232.5
6: chr1 64931 G A b38 chr1_64931_G_A_b38 0.276679 ENSG00000227232.5
> head(asign)
gene chr chr_pos pos p.val.Retina
1: ENSG00000227232 chr1 1:10177:A:AC 10177 0.381708
2: ENSG00000227232 chr1 rs145072688:10352:T:TA 10352 0.959523
3: ENSG00000227232 chr1 1:11008:C:G 11008 0.218132
4: ENSG00000227232 chr1 1:11012:C:G 11012 0.218132
5: ENSG00000227232 chr1 1:13110:G:A 13110 0.998262
6: ENSG00000227232 chr1 rs201725126:13116:T:G 13116 0.438572
> m=merge(res, asign, by = c("chr", "pos"))
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join. Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.
我尝试使用by=.EACHI,但得到了相同的错误。
I最终合并的文件我只需要保持匹配:"chr","pos","pval_normal","p.val.Retina">
我只需要"res"one_answers"asign"数据帧之间的公共行。
我可以从这两个数据帧中删除我不需要的列,我得到了这个:
> head(asignx)
chr pos p.val.Retina
1: chr1 10177 0.381708
2: chr1 10352 0.959523
3: chr1 11008 0.218132
4: chr1 11012 0.218132
5: chr1 13110 0.998262
6: chr1 13116 0.438572
> head(l4x)
chr pos pval_nominal
1: chr1 13550 0.375614
2: chr1 14671 0.474708
3: chr1 14677 0.699887
4: chr1 16841 0.127895
5: chr1 16856 0.627822
6: chr1 17005 0.802803
但当我尝试合并这些时:
> m=merge(l4x,asignx, by = c("chr", "pos"),all.x=FALSE,all.y=FALSE)
Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in more than 2^31 rows (internal vecseq reached physical limit)
假设您的两个数据帧都加载到一个数据库中(您必须设置一个类似postgres或sql server的数据库(,则sql等效于:
m=merge(res, asign, by = c("chr", "pos"))
是否
select *
into m_table
from res
join asign
on res.chr = asign.chr
and res.pos = asign.pos
然后你会有一个新的表格:
select *
from m_table;