如何在C++中滑动窗口读取多行fasta文件

我有一个C++代码，它读取一个fasta文件(单行读取(，并在读取上滑动一个l窗口以生成kmer。像这样：

>NC_000913.3-2320800
GACGAAGCTGCCGACCAGCGTTTTATGTCAGTGCGGACATCAGCACTGCGTGAACAATTTGCCTGGCTGCGCGAGAACGGTTATCAACCGGTCAGTATTG

现在我有另一个fasta文件，它读取了不止一行，如下所示：

>LR2957 4054ae1b-13f5-89b1-e7b5-d64cabb8badb NC_000913.3,+strand,4451319-4510860 length=55312 error-free_length=59451 read_identity=74.36%
TGTAGTCCGTCAAGTTACGTTATTGCTACGTCTATCAGGGAAGTCAACCTGCCTGCAATA
TGGTAGATAAATCCTATTATGCCGCGAGACAACCCTTGGCTTCCTACACGCGCAGTGGAG

此功能是：

void Sliding_window_l (const char *ptr, size_t length) 
{   
size_t p=0;
/*find start of a read*/
for(; ptr[p]!='>' && p<length; p++) {/*noop*/}
kmer_t kmer = 0;
while(p<length) {
//printf("hello second %cn", ptr[p]);  
assert(ptr[p]=='>'); /*this will be true*/
/*skip till newline*/
for(; p<length && ptr[p]!='n'; p++) {/*noop*/ }
p++; /*skip the newline*/
if(p+LMER_LENGTH > length) break; /*too short a read*/
kmer = 0;
int i;
for(i=0; ptr[p]!='n' && i<LMER_LENGTH-1; i++) {
kmer = lmer_shift(kmer, char_to_el(ptr[p++]));
//kmer = kmer_cons(kmer, i, char_to_el(ptr[p++]));
}
while(p<length && ptr[p]!='n') {
kmer = lmer_shift(kmer, char_to_el(ptr[p++]));
lmer_frequency[kmer]++;
}
p++; /*skip the newline*/
}
}

该函数对于单行读取运行良好，但对于多行读取，我会得到这样的错误：

void Sliding_window_l(const char*, size_t): Assertion `ptr[p]=='>'' failed.

有人能指导我如何修改这个函数，使它也能为多行读取运行吗。非常感谢。

查看您的代码：

while(p < length) {
// The first time the assertion is true
assert(ptr[p]=='>');
for(; p<length && ptr[p]!='n'; p++) {/*noop*/ }
p++;
// Now p points at the beginning of the first line of payload
// more code...
while(p < length && ptr[p] != 'n') {
//...
}
// You've finished the first line
p++; /*skip the newline*/
// Now p may point to the beginning of another line of the payload
// The next iteration would start with p pointing NOT to the '>'
}

你的意思是断言应该在循环之前完成吗？

不管怎样，如果我正确理解了你的代码，你就会丢失与上一行末尾和下一行开头重叠的kmer。

指南：

简化你的生活。逐行读取，将有效载荷连接到一大串核苷酸。每当您找到输入的末尾(或找到下一个以">"开头的注释(时，您都可以开始处理串联读取
避免const char*，更喜欢std::string：你在用C++，不是吗
重新格式化代码以使其更可读：糟糕的格式是邪恶的根源

相关内容

最新更新

热门标签：