删除字符串,并使用尴尬或SED添加顺序编号



我有以下input

>Thimo_0001|ID:40710520| hypothetical protein [Thioflavicoccus mobilis 8321]
LIAPTMILRIRLTEFCPMRTEGFEE
TGIGPLDSRMPRYDDVVHHREIIT
YPPEALSNDPFDPTSIDGSPSAFF*
>ThimoAM_0002|ID:40707134| protein of unknown function [Thioflavicoccus mobilis 8321]
VRKAERDSPCKRRGADRSFP
KSARLISSKAFRDVFAESITNSDPFFVVR
ARPNLAETARLGIAVSKKCARRSVDRSRIKRII
RESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA*
>Thimo_0002|ID:40710524| ribonuclease P protein component [Thioflavicoccus mobilis 8321]
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRAR
TTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAP
RRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL*

我想

  1. >开头,删除线路中的线路断开
  2. 删除星号
  3. 更改Fasta标头

我可以做1.2.

awk '!/^>/ { printf "%s", $0; n = "n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' 
sed "s/*//g" 

我还可以在标题线的末端添加一个顺序数:

awk '/^>/{$0=$0"_"(++i)}1'

,但是我在最后一步失败了,替换/删除并添加了一个顺序数字:

所需的output

>TM0001|hypothetical_protein  
LIAPTMILRIRLTEFCPMRTEGFEETGIGPLDSRMPRYDDVVHHREIITYPPEALSNDPFDPTSIDGSPSAFF
>TM0002|protein_of_unknown_function  
VRKAERDSPCKRRGADRSFPKSARLISSKAFRDVFAESITNSDPFFVVRARPNLAETARLGIAVSKKCARRSVDRSRIKRIIRESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA
>TM0003|ribonuclease_P_protein_component  
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRARTTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAPRRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL

根据您的"所需"输出 - gawk 解决方案:

awk 'BEGIN{ RS=">"; FS="[|\]\[]" }!$0{ next }
     { gsub(/^ */,"",$3); gsub(/[*[:space:]]/,"",$5); printf(">TM%04d|%sn%sn",++c,$3,$5) 
}' yourfile

输出:

>TM0001|hypothetical protein 
LIAPTMILRIRLTEFCPMRTEGFEETGIGPLDSRMPRYDDVVHHREIITYPPEALSNDPFDPTSIDGSPSAFF
>TM0002|protein of unknown function 
VRKAERDSPCKRRGADRSFPKSARLISSKAFRDVFAESITNSDPFFVVRARPNLAETARLGIAVSKKCARRSVDRSRIKRIIRESFRWVRNDLPVMDYVVIARHAAVKRTNPRLFESLRSHWTKFSEPDA
>TM0003|ribonuclease P protein component 
MILLIRLRSTDRRAHFFDTAIPNLAVSARLGRARTTKNGSEFVMDSAKTSRNAFEEISLADFGKERSAPRRLQGESLSAFRTTRGQDEPATFRCPTRPKPMCMRAL

详细信息:

  • RS=">"-将>作为记录分隔符

  • FS="[|\]\[]"-字段分离器,任何字符 |[]

  • !$0{ next }-跳过空记录

  • gsub(/^ */,"",$3)-删除第三字段中的领先空间

  • gsub(/[*[:space:]]/,"",$5)-替换/删除/删除第五字段中的星空 *和whitespace字符

最新更新