根据前8列合并选项卡划分的文件



我有一个tab-delimited文件(我们称其为file1),看起来像这样:

NC_027300.1 Gnomon  exon    5501    5691    .   -   .   gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon  exon    16966   17019   .   -   .   gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon  exon    23978   24241   .   -   .   gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon  exon    43486   43714   .   -   .   gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon  exon    61647   62139   .   -   .   gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon  CDS 5501    5691    .   -   2   gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon  CDS 16966   17019   .   -   2   gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon  CDS 23978   24241   .   -   2   gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon  CDS 43486   43633   .   -   0   gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon  exon    160437  160638  .   -   .   gene_id "2"; transcript_id "2.1";
NC_027300.1 Gnomon  exon    160913  161019  .   -   .   gene_id "2"; transcript_id "2.1";

和一个看起来像这样的较大的tab-delim文件(file2):

NC_027300.1 Gnomon  gene    5501    62139   .   -   .   ID=gene0;Dbxref=GeneID:106560212;Name=LOC106560212;gbkey=Gene;gene=LOC106560212;gene_biotype=protein_coding
NC_027300.1 Gnomon  mRNA    5501    62139   .   -   .   ID=rna0;Parent=gene0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;Name=XM_014160784.1;gbkey=mRNA;gene=LOC106560212;model_evidence=Supporting evidence includes similarity to: 99%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 8 samples with support for all annotated introns;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
NC_027300.1 Gnomon  exon    61647   62139   .   -   .   ID=id1;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
NC_027300.1 Gnomon  exon    43486   43714   .   -   .   ID=id2;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
NC_027300.1 Gnomon  exon    23978   24241   .   -   .   ID=id3;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
NC_027300.1 Gnomon  exon    16966   17019   .   -   .   ID=id4;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
NC_027300.1 Gnomon  exon    5501    5691    .   -   .   ID=id5;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
NC_027300.1 Gnomon  CDS 43486   43633   .   -   0   ID=cds0;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XP_014016259.1;Name=XP_014016259.1;gbkey=CDS;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;protein_id=XP_014016259.1
NC_027300.1 Gnomon  CDS 23978   24241   .   -   2   ID=cds0;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XP_014016259.1;Name=XP_014016259.1;gbkey=CDS;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;protein_id=XP_014016259.1
NC_027300.1 Gnomon  CDS 16966   17019   .   -   2   ID=cds0;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XP_014016259.1;Name=XP_014016259.1;gbkey=CDS;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;protein_id=XP_014016259.1

我想创建一个新文件,该文件仅包含File1中也存在的线条,该文件也基于前8列中,该列中的所有9列中的所有9列和File2的第9列AS列为第10列10。

NC_027300.1 Gnomon  exon    5501    5691    .   -   .   gene_id "1"; transcript_id "1.1"; ID=id5;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1

我一直在尝试遵循此示例,这是(凭着我非常有限的知识)我想到的:

awk 'NR==FNR{a[$1,$2,$3,$4,$5,$6,$7,$8]=$10;next} ($1,$2,$3,$4,$5,$6,$7,$8) in a{print $0, a[$$1,$2,$3,$4,$5,$6,$7,$8]}' file1 file2 > newfile

有人可以告诉我我是否附近的任何地方都可以帮忙吗?我的文件是1M 行,目前正在运行,但是我担心可能还需要一段时间才能看一下它是否有效!预先感谢

切换输入文件顺序和整理:

awk '
BEGIN { FS=OFS="t" }
{ k = $1 FS $2 FS $3 FS $4 FS $5 FS $6 FS $7 FS $8 }
NR==FNR { a[k]=$9; next }
k in a { print $0, a[k] }
' file2 file1

您在正确的路径上,看起来您需要小校正

更改

a[$$1,$2,$3,$4,$5,$6,$7,$8]
  ^
 Here

to

a[$1,$2,$3,$4,$5,$6,$7,$8]

,如果在数组a中的file1中打印第10个字段,如果使用File2的8个字段制成的索引密钥在数组a中存在,该键是使用File1的1st 8字段创建的。

最新更新