我有一个tab-delimited文件(我们称其为file1),看起来像这样:
NC_027300.1 Gnomon exon 5501 5691 . - . gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon exon 16966 17019 . - . gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon exon 23978 24241 . - . gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon exon 43486 43714 . - . gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon exon 61647 62139 . - . gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon CDS 5501 5691 . - 2 gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon CDS 16966 17019 . - 2 gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon CDS 23978 24241 . - 2 gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon CDS 43486 43633 . - 0 gene_id "1"; transcript_id "1.1";
NC_027300.1 Gnomon exon 160437 160638 . - . gene_id "2"; transcript_id "2.1";
NC_027300.1 Gnomon exon 160913 161019 . - . gene_id "2"; transcript_id "2.1";
和一个看起来像这样的较大的tab-delim文件(file2):
NC_027300.1 Gnomon gene 5501 62139 . - . ID=gene0;Dbxref=GeneID:106560212;Name=LOC106560212;gbkey=Gene;gene=LOC106560212;gene_biotype=protein_coding
NC_027300.1 Gnomon mRNA 5501 62139 . - . ID=rna0;Parent=gene0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;Name=XM_014160784.1;gbkey=mRNA;gene=LOC106560212;model_evidence=Supporting evidence includes similarity to: 99%25 coverage of the annotated genomic feature by RNAseq alignments%2C including 8 samples with support for all annotated introns;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
NC_027300.1 Gnomon exon 61647 62139 . - . ID=id1;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
NC_027300.1 Gnomon exon 43486 43714 . - . ID=id2;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
NC_027300.1 Gnomon exon 23978 24241 . - . ID=id3;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
NC_027300.1 Gnomon exon 16966 17019 . - . ID=id4;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
NC_027300.1 Gnomon exon 5501 5691 . - . ID=id5;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
NC_027300.1 Gnomon CDS 43486 43633 . - 0 ID=cds0;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XP_014016259.1;Name=XP_014016259.1;gbkey=CDS;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;protein_id=XP_014016259.1
NC_027300.1 Gnomon CDS 23978 24241 . - 2 ID=cds0;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XP_014016259.1;Name=XP_014016259.1;gbkey=CDS;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;protein_id=XP_014016259.1
NC_027300.1 Gnomon CDS 16966 17019 . - 2 ID=cds0;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XP_014016259.1;Name=XP_014016259.1;gbkey=CDS;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;protein_id=XP_014016259.1
我想创建一个新文件,该文件仅包含File1中也存在的线条,该文件也基于前8列中,该列中的所有9列中的所有9列和File2的第9列AS列为第10列10。
NC_027300.1 Gnomon exon 5501 5691 . - . gene_id "1"; transcript_id "1.1"; ID=id5;Parent=rna0;Dbxref=GeneID:106560212,Genbank:XM_014160784.1;gbkey=mRNA;gene=LOC106560212;product=fibroblast growth factor receptor 3-like;transcript_id=XM_014160784.1
我一直在尝试遵循此示例,这是(凭着我非常有限的知识)我想到的:
awk 'NR==FNR{a[$1,$2,$3,$4,$5,$6,$7,$8]=$10;next} ($1,$2,$3,$4,$5,$6,$7,$8) in a{print $0, a[$$1,$2,$3,$4,$5,$6,$7,$8]}' file1 file2 > newfile
有人可以告诉我我是否附近的任何地方都可以帮忙吗?我的文件是1M 行,目前正在运行,但是我担心可能还需要一段时间才能看一下它是否有效!预先感谢
切换输入文件顺序和整理:
awk '
BEGIN { FS=OFS="t" }
{ k = $1 FS $2 FS $3 FS $4 FS $5 FS $6 FS $7 FS $8 }
NR==FNR { a[k]=$9; next }
k in a { print $0, a[k] }
' file2 file1
您在正确的路径上,看起来您需要小校正
更改
a[$$1,$2,$3,$4,$5,$6,$7,$8]
^
Here
to
a[$1,$2,$3,$4,$5,$6,$7,$8]
,如果在数组a
中的file1中打印第10个字段,如果使用File2的8个字段制成的索引密钥在数组a
中存在,该键是使用File1的1st 8字段创建的。