应用字符串点突变的更快方法

我有一个csv文件，它的行是不同位置的一堆突变。我有一个包含24个字符串的元组，这些字符串对应于我顺序应用这些突变的染色体。程序速度没有我预期的那么快，我不确定是字符串串联导致代码速度减慢还是元组重新加入。有没有一种更简单的方法来应用这些点突变(或点突变的聚集组(。这是当前代码：

def applySimpleMutation(chromosome, st_idx, ed_idx, mutation_from_string, mutation_to_string, chromosome_tuple): 
actual_st_idx = st_idx-1 #it is one indexed
actual_ed_idx = ed_idx #This is because python string indexing doesnt include the last one 
if(chromosome == 'X'): 
tup_idx = 22
elif(chromosome == 'Y'): 
tup_idx = 23
else: 
tup_idx = int(chromosome)-1
chrom_string = chromosome_tuple[tup_idx]
start_string = chrom_string[:actual_st_idx]
continue_string = chrom_string[actual_ed_idx:]
if(mutation_to_string == '-'): 
final_string = start_string + continue_string
else:
final_string = start_string + mutation_to_string+ continue_string
final_tuple = chromosome_tuple[:tup_idx] + (final_string,) + chromosome_tuple[tup_idx+1:]
return final_tuple

数据帧上的代码循环：

def mutateWithDataframe(df, r): 
for index, row in tqdm(df.iterrows(), total = df.shape[0]):
chromosome = str(row['chromosome'])
st_idx = int(row['chromosome_start'])
ed_idx = int(row['chromosome_end'])
mutation_from_string = str(row['mutated_from_allele'])
mutation_to_string = str(row['mutated_to_allele'])
r = applySimpleMutation(chromosome, st_idx, ed_idx, mutation_from_string, mutation_to_string, r)
return r

对于每一个突变，您都会完全重写元组中的24条染色体。只保存正在改变的染色体不是更有意义吗？使用可变列表而不是不可变元组。您可以简单地写入原始列表的索引值。下面的演示。

def applySimpleMutation(chromosome, st_idx, ed_idx, mutation_from_string, mutation_to_string, chromosome_li): 
actual_st_idx = st_idx-1 #it is one indexed
actual_ed_idx = ed_idx #This is because python string indexing doesnt include the last one 
if(chromosome == 'X'): 
_idx = 22
elif(chromosome == 'Y'): 
_idx = 23
else: 
_idx = int(chromosome)-1
chrom_string = chromosome_li[_idx]
start_string = chrom_string[:actual_st_idx]
continue_string = chrom_string[actual_ed_idx:]
if(mutation_to_string == '-'): 
chromosome_li[_idx] = start_string + continue_string
else:
chromosome_li[_idx] = start_string + mutation_to_string+ continue_string

df是包含突变信息的数据帧

r是一种含有染色体的发光体

def mutateWithDataframe(df, r):
for index, row in tqdm(df.iterrows(), total = df.shape[0]):
chromosome = str(row['chromosome'])
st_idx = int(row['chromosome_start'])
ed_idx = int(row['chromosome_end'])
mutation_from_string = str(row['mutated_from_allele'])
mutation_to_string = str(row['mutated_to_allele'])
applySimpleMutation(chromosome, st_idx, ed_idx, mutation_from_string, mutation_to_string, r)
return r

这应该提供相同的输出，但具有明显更好的性能。

相关内容

最新更新

热门标签：