使用模式从文件中检索数据,并用文件名对其进行注释



我有一个名为bin.001.fasta的文件,如下所示:

>contig_655
GGCGGTTATTTAGTATCTGCCACTCAGCCTCGCTATTATGCGAAATTTGAGGGCAGGAGGAAACCATGAC
AGTAGTCAAGTGCGACAAGC
>contig_866
CCCAGACCTTTCAGTTGTTGGGTGGGGTGGGTGCTGACCGCTGGTGAGGGCTCGACGGCGCCCATCCTGG
CTAGTTGAAC
...

我想做的是获得一个新文件,其中第一列是检索的contig ID,第二列是不带.fasta:的文件名

contig_655    bin.001
contig_866    bin.001

有什么想法吗?

请您尝试以下操作。

awk -F'>' '
FNR==1{
split(FILENAME,array,".")
file=array[1]"."array[2]
}
/^>/{
print $2,file
}
'  Input_file

或者,如果您的Input_file有2个以上的点,则运行以下命令。

awk -F'>' '
FNR==1{
match(FILENAME,/.*./)
file=substr(FILENAME,RSTART,RLENGTH-1)
}
/^>/{
print $2,file
}
'  Input_file

解释:添加对上述代码的详细解释。

awk -F'>' '                   ##Starting awk program from here and setting field separator as > here for all lines.
FNR==1{                       ##Checking condition if this is first line then do following.
split(FILENAME,array,".")   ##Splitting filename which is passed to this awk program into an array named array with delimiter .
file=array[1]"."array[2]    ##Creating variable file whose value is 1st and 2nd element of array with DOT in between as per OP shown sample.
}
/^>/{                         ##Checking condition if a line starts with > then do following.
print $2,file               ##Printing 2nd field and variable file value here.
}
' Input_file                  ##Mentioning Input_file name here.

最新更新