替换两种模式之间的字符串



我想用我的文件A.nwk:中的ZZZ替换(使用sed/awk/tr)CleanAgrobacterium_gene之间的所有字符串

(((CleanAgrobacterium_fabrum_str__C58_DE0068_Scaffold_Proteins_gene-FS783_RS12830:0,CleanAgrobacterium_fabrum_str__C58_DE0067_Scaffold_Proteins_gene-FS653_RS12825:0):0.056789,(CleanAgrobacterium_fabrum_GV2260_Complete_Genome_Proteins_gene-EML4058_RS17445:0,(CleanAgrobacterium_fabrum_1D1416_Chromosome_Proteins_gene-NQG32_RS17500:0,(CleanAgrobacterium_fabrum_PDC82_Contig_Proteins_gene-BLT49_RS14090:0,(CleanAgrobacterium_fabrum_N3394_Scaffold_Proteins_gene-G6L76_RS17395:0,(CleanAgrobacterium_fabrum_12D13_Complete_Genome_Proteins_gene-At12D13_RS18010:0,(CleanAgrobacterium_fabrum_Bi46_Contig_Proteins_gene-LQ162_RS02700:0,(CleanAgrobacterium_fabrum_ARqua1_Scaffold_Proteins_gene-HI842_RS18310:0,(CleanAgrobacterium_fabrum_N4094_Scaffold_Proteins_gene-G6L42_RS17400:0,(CleanAgrobacterium_fabrum_GV3101__pMP90_Complete_Genome_Proteins_gene-EML485_RS17435:0,(CleanAgrobacterium_fabrum_Kin001_Complete_Genome_Proteins_gene-FY134_RS17775:0,(CleanAgrobacterium_fabrum_LBA645_Complete_Genome_Proteins_gene-KXJ62_RS17445:0,(CleanAgrobacterium_fabrum_Di1525a_Scaffold_Proteins_gene-G6L89_RS17735:0,(CleanAgrobacterium_fabrum_NFIX02_Scaffold_Proteins_gene-BLR22_RS16795:0,(CleanAgrobacterium_fabrum_Arqua_Contig_Proteins_gene-EXN51_RS19140:0,(CleanAgrobacterium_fabrum_str__J-07_J-07_Scaffold_Proteins_gene-AGR8A_RS20015:0,CleanAgrobacterium_fabrum_1D132_Complete_Genome_Proteins_gene-At1D132_RS18580:0):0):0):0):0):0):0):0):0):0):0):0):0):0):0):0):0,(CleanAgrobacterium_fabrum_EHA105_Complete_Genome_Proteins_gene-EML540_RS17455:0,(CleanAgrobacterium_fabrum_RIT-As-3_Contig_Proteins_gene-ORG40_RS11815:0,(CleanAgrobacterium_fabrum_2788_Contig_Proteins_gene-G6L39_RS17590:0,(CleanAgrobacterium_fabrum_BG5_Complete_Genome_Proteins_gene-F3P66_RS17495:0,(CleanAgrobacterium_fabrum_Bi05_Contig_Proteins_gene-LQV40_RS07170:0,(CleanAgrobacterium_fabrum_str__C58_C58_Complete_Genome_Proteins_gene-ATU_RS17440:0,CleanAgrobacterium_fabrum_NFIX01_Scaffold_Proteins_gene-BMY00_RS16800:0):0):0):0):0):0):0);
sed "/CleanAgrobacterium/,/gene-/d" A.nwk

您可以使用[[:alnum:]_-]+使模式更具体地用于匹配1个或多个字母数字字符或介于两者之间的-_的示例数据,而不是使用范围,并用zzz替换匹配

sed "s/CleanAgrobacterium[[:alnum:]_-]+_gene/zzz/g" A.nwk

输出

(((zzz-FS783_RS12830:0,zzz-FS653_RS12825:0):0.056789,(zzz-EML4058_RS17445:0,(zzz-NQG32_RS17500:0,(zzz-BLT49_RS14090:0,(zzz-G6L76_RS17395:0,(zzz-At12D13_RS18010:0,(zzz-LQ162_RS02700:0,(zzz-HI842_RS18310:0,(zzz-G6L42_RS17400:0,(zzz-EML485_RS17435:0,(zzz-FY134_RS17775:0,(zzz-KXJ62_RS17445:0,(zzz-G6L89_RS17735:0,(zzz-BLR22_RS16795:0,(zzz-EXN51_RS19140:0,(zzz-AGR8A_RS20015:0,zzz-At1D132_RS18580:0):0):0):0):0):0):0):0):0):0):0):0):0):0):0):0):0,(zzz-EML540_RS17455:0,(zzz-ORG40_RS11815:0,(zzz-G6L39_RS17590:0,(zzz-F3P66_RS17495:0,(zzz-LQV40_RS07170:0,(zzz-ATU_RS17440:0,zzz-BMY00_RS16800:0):0):0):0):0):0):0);

这将用ZZZ:替换CleanAgrobacterium_gene之间的所有文本

sed -E 's/(CleanAgrobacterium).*(_gene)/1ZZZ2/g' A.nwk

但结果可能不是你所期望的。我想您希望(.*)之间的文本不规则匹配。为此,请使用perl:

perl -pe 's/(CleanAgrobacterium).*(_gene)/1ZZZ2/g' A.nwk

这可能对你有用(GNU sed):

sed -E 's/CleanAgrobacterium/&n/g
s/gene-/n&/g
s/(CleanAgrobacterium)n[^n]*n(gene-)/1ZZZ2/g
s/n//g' file

CleanAgrobacterium后面加一个换行符,在gene-前面加一个新行符。

替换所需单词之间不是换行符的所有单词。

删除所有引入的换行符。

注意:这不适用于单独线路上的比赛。在这种情况下,使用类似的东西:

sed -E 'H;1h;$!d;x
s/n/@@@NEWLINE%%%/g
s/CleanAgrobacterium/&n/g
s/gene-/n&/g
s/(CleanAgrobacterium)n[^n]*n(gene-)/1ZZZ2/g
s/n//g
s/@@@NEWLINE%%%/n/g' file

这会将整个文件拖入内存,用一个唯一的字符串替换所有换行符,然后应用第一个解决方案,然后进行整理。

试试这个:

sed 's/gene-/gene-n/g' < A.nwk | sed 's/CleanAgrobacterium.*gene-/CleanAgrobacteriumZZZgene-/g' | sed -n ':a;N;$!ba;s/n//g;p' > output.txt 

使用Linux与GNU Sed 4.9配合使用。

另一个sed解决方案。它用THET(在现实中使用您的样本,但在这里更可读)替换了"中START和END之间的所有THIS;fooSTARTTHISENDfooSTARTTHISENDfoo"并输出"0";fooSTART那个ENDfooSTART那个ENDfoo"。

$ sed -E 's/(CleanAgrobacterium)([^_]|_(_|g(_|e(_|n_)))*([^_g]|g([^_e]|e([^_n]|n[^_e]))))*(_(_|g(_|e(_|n_)))*(g(e?|en))?)?(_gene)/1ZZZ2/g' file

该解决方案是非贪婪的,并且依赖于正则表达式捕获组(CleanAgrobacterium)(_gene),它们的反向引用12以及它们之间的内容
([^_]|_(_|g(_|e(_|n_)))*([^_g]|g([^_e]|e([^_n]|n[^_e]))))*(_(_|g(_|e(_|n_)))*(g(e?|en))?)?
(而不是_gene)被ZZZ替换。例如,您可以在中使用它;GNU awk的gensub()支持反向引用:

$ gawk '{print gensub(/(CleanAgrobacterium)([^_]|_(_|g(_|e(_|n_)))*([^_g]|g([^_e]|e([^_n]|n[^_e]))))*(_(_|g(_|e(_|n_)))*(g(e?|en))?)?(_gene)/,"\1ZZZ\2","g",$0)}' file

最新更新