查找存在两次的模式,并允许每个模式出现 <=2 个不匹配



我有一个400,000读取的fastq文件(所以速度很重要)。在序列中有条形码集成,应该出现两次。给定一个条形码,我想找到有条形码出现两次与<= 2不匹配的序列。因此,使用条形码'ATTCGACCGATAGG',我想检索以下所有序列-

TATCTTGTGGAAAGGACGAAACACCGAACACAAAGCATAGATGCGTTTAAGAGCTATGCTGGAAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTTATTCGACCGATAGGGGTGGCAGGGGAGGCCGAGGAGGAAGAAGGGGAGGTGGCAGATTCGACCGATAGGTGGCGTAACTAGATCTTGAGACAAATATCTTGTGGAAAGGACGAAACACCGGTCCGAGCAGAAGAAGAAGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTTATTCGACCGATAGGGGTGGCAGGGGAGGCCGAGGAGGAAGAAGGGGAGGTGGCAGATTCGACCGATAGGTGGCGTAACTAGATCTTGAGACAAATATCTTGTGGAAAGGACGAAACACCGAGTCCGAGCAGAAGAAGAAGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTTATTCGACCGATAGGGGTGGCAGGGGAGGCCGAGGAGGAAGAAGGGGAGGTGGCAGATTCGACCGATAGGTGGCGTAACTAGATCTTGAGACAAATATCTTGTGGAAAGGACGAAACACCGAGTCCGAGCAGAAGAAGAAGTTTAAGAGCTATGCTGGAAACAGCATAGCAAGTTTAAATAAGGCTAGTCCGTTATCAACTTGAAAAAGTGGCACCGAGTCGGTGCTTTTTTTATTCGACGATAGGGGTGGCAGGGGAGGCCGAGGAGGAAGAAGGGGAGGTGGCAGATTCGACCGATAGGTGGCGTAACTAGATCTTGAGACAAA

请注意,第四个序列中的第一个条形码缺少一个字符。我已经尝试过生物蟒蛇和正则表达式,但它太慢了,因为我有5k条形码。我想知道是否有一个快速的解决方案可用在python或像grep, awk或其他任何东西。谢谢。

使用GNU awk:

awk '{ for (i=1;i<=NF;i++) { fnd=0;subs=$i;while (match(subs,"ATTCGACCGATAGG")) { subs=substr(subs,RSTART+RLENGTH);if (RSTART>0) { fnd++;print fnd } } if (fnd <=2) { print $i } } }' file

解释:

awk '{ for (i=1;i<=NF;i++) {                           # Loop on each space delimited field
fnd=0;                                         # Initialise fnd variable/counter
subs=$i;                                       # Initialise substring variable
while (match(subs,"ATTCGACCGATAGG")) { 
subs=substr(subs,RSTART+RLENGTH);            # Check for multiple matches of "ATTCGACCGATAGG" in subs.
if (RSTART>0) { 
fnd++;                                    # Increment fnd if string found in subs
} 
} 
if (fnd <=2) { 
print $i                                    # If found twice or less than twice print the field
}
} 
}' file

相关内容

  • 没有找到相关文章

最新更新