我有一个csv文件,它的行是不同位置的一堆突变。我有一个包含24个字符串的元组,这些字符串对应于我顺序应用这些突变的染色体。程序速度没有我预期的那么快,我不确定是字符串串联导致代码速度减慢还是元组重新加入。有没有一种更简单的方法来应用这些点突变(或点突变的聚集组(。这是当前代码:
def applySimpleMutation(chromosome, st_idx, ed_idx, mutation_from_string, mutation_to_string, chromosome_tuple):
actual_st_idx = st_idx-1 #it is one indexed
actual_ed_idx = ed_idx #This is because python string indexing doesnt include the last one
if(chromosome == 'X'):
tup_idx = 22
elif(chromosome == 'Y'):
tup_idx = 23
else:
tup_idx = int(chromosome)-1
chrom_string = chromosome_tuple[tup_idx]
start_string = chrom_string[:actual_st_idx]
continue_string = chrom_string[actual_ed_idx:]
if(mutation_to_string == '-'):
final_string = start_string + continue_string
else:
final_string = start_string + mutation_to_string+ continue_string
final_tuple = chromosome_tuple[:tup_idx] + (final_string,) + chromosome_tuple[tup_idx+1:]
return final_tuple
数据帧上的代码循环:
def mutateWithDataframe(df, r):
for index, row in tqdm(df.iterrows(), total = df.shape[0]):
chromosome = str(row['chromosome'])
st_idx = int(row['chromosome_start'])
ed_idx = int(row['chromosome_end'])
mutation_from_string = str(row['mutated_from_allele'])
mutation_to_string = str(row['mutated_to_allele'])
r = applySimpleMutation(chromosome, st_idx, ed_idx, mutation_from_string, mutation_to_string, r)
return r
对于每一个突变,您都会完全重写元组中的24条染色体。只保存正在改变的染色体不是更有意义吗?使用可变列表而不是不可变元组。您可以简单地写入原始列表的索引值。下面的演示。
def applySimpleMutation(chromosome, st_idx, ed_idx, mutation_from_string, mutation_to_string, chromosome_li):
actual_st_idx = st_idx-1 #it is one indexed
actual_ed_idx = ed_idx #This is because python string indexing doesnt include the last one
if(chromosome == 'X'):
_idx = 22
elif(chromosome == 'Y'):
_idx = 23
else:
_idx = int(chromosome)-1
chrom_string = chromosome_li[_idx]
start_string = chrom_string[:actual_st_idx]
continue_string = chrom_string[actual_ed_idx:]
if(mutation_to_string == '-'):
chromosome_li[_idx] = start_string + continue_string
else:
chromosome_li[_idx] = start_string + mutation_to_string+ continue_string
df
是包含突变信息的数据帧
r
是一种含有染色体的发光体
def mutateWithDataframe(df, r):
for index, row in tqdm(df.iterrows(), total = df.shape[0]):
chromosome = str(row['chromosome'])
st_idx = int(row['chromosome_start'])
ed_idx = int(row['chromosome_end'])
mutation_from_string = str(row['mutated_from_allele'])
mutation_to_string = str(row['mutated_to_allele'])
applySimpleMutation(chromosome, st_idx, ed_idx, mutation_from_string, mutation_to_string, r)
return r
这应该提供相同的输出,但具有明显更好的性能。