我有以下input
:
>Thimo_0001|ID:40710520| hypothetical protein [Thioflavicoccus mobilis 8321]
LIAPTMILRIRLTEFCPMRTEGFEE
TGIGPLDSRMPRYDDVVHHREIIT
YPPEALSNDPFDPTSIDGSPSAFF*
>ThimoAM_0002|ID:40707134| protein of unknown function [Thioflavicoccus mobilis 8321]
VRKAERDSPCKRRGADRSFP
KSARLISSKAFRDVFAESITNSDPFFVVR
ARPNLAETARLGIAVSKKCARRSVDRSRIKRII
RESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA*
>Thimo_0002|ID:40710524| ribonuclease P protein component [Thioflavicoccus mobilis 8321]
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRAR
TTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAP
RRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL*
我想
- 以
>
开头,删除线路中的线路断开 - 删除星号
- 更改Fasta标头
我可以做1.
和2.
awk '!/^>/ { printf "%s", $0; n = "n" } /^>/ { print n $0; n = "" } END { printf "%s", n }'
sed "s/*//g"
我还可以在标题线的末端添加一个顺序数:
awk '/^>/{$0=$0"_"(++i)}1'
,但是我在最后一步失败了,替换/删除并添加了一个顺序数字:
所需的output
>TM0001|hypothetical_protein
LIAPTMILRIRLTEFCPMRTEGFEETGIGPLDSRMPRYDDVVHHREIITYPPEALSNDPFDPTSIDGSPSAFF
>TM0002|protein_of_unknown_function
VRKAERDSPCKRRGADRSFPKSARLISSKAFRDVFAESITNSDPFFVVRARPNLAETARLGIAVSKKCARRSVDRSRIKRIIRESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA
>TM0003|ribonuclease_P_protein_component
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRARTTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAPRRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL
根据您的"所需"输出 - gawk 解决方案:
awk 'BEGIN{ RS=">"; FS="[|\]\[]" }!$0{ next }
{ gsub(/^ */,"",$3); gsub(/[*[:space:]]/,"",$5); printf(">TM%04d|%sn%sn",++c,$3,$5)
}' yourfile
输出:
>TM0001|hypothetical protein
LIAPTMILRIRLTEFCPMRTEGFEETGIGPLDSRMPRYDDVVHHREIITYPPEALSNDPFDPTSIDGSPSAFF
>TM0002|protein of unknown function
VRKAERDSPCKRRGADRSFPKSARLISSKAFRDVFAESITNSDPFFVVRARPNLAETARLGIAVSKKCARRSVDRSRIKRIIRESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA
>TM0003|ribonuclease P protein component
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRARTTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAPRRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL
详细信息:
RS=">"
-将>
作为记录分隔符FS="[|\]\[]"
-字段分离器,任何字符|[]
!$0{ next }
-跳过空记录gsub(/^ */,"",$3)
-删除第三字段中的领先空间gsub(/[*[:space:]]/,"",$5)
-替换/删除/删除第五字段中的星空*
和whitespace字符