Python str.replace 问题:每次替换字符串后插入换行符



我需要编辑gff文件的基因名称,如下所示。 原始文件:

chr1    aug gene    10708   108196  .   -   .   ID=gene:g754;biotype=protein_coding
chr1    aug exon    10708   107528  .   -   .   Parent=transcript:g754;Name=g754_T001.exon.1;exon_id=g754_T001.exon.1
chr1    aug gene    20588   20898   .   -   .   ID=gene:g756;biotype=protein_coding
chr1    aug mRNA    20588   20898   .   -   .   ID=transcript:g756;Parent=gene:g756;biotype=protein_coding;transcript_id=g756_T001
chr1    aug exon    20588   20690   .   -   .   Parent=transcript:g756_T001;Name=g756_T001.exon.1;exon_id=g756_T001.exon.1

新文件:

chr1    aug gene    10708   108196  .   -   .   ID=gene:Gene00001;biotype=protein_coding
chr1    aug exon    10708   107528  .   -   .   Parent=transcript:Gene00001;Name=Gene00001_T001.exon.1;exon_id=Gene00001_T001.exon.1
chr1    aug gene    20588   20898   .   -   .   ID=gene:Gene00002;biotype=protein_coding
chr1    aug mRNA    20588   20898   .   -   .   ID=transcript:Gene00002;Parent=gene:Gene00002;biotype=protein_coding;transcript_id=Gene00002_T001
chr1    aug exon    20588   20690   .   -   .   Parent=transcript:Gene00002_T001;Name=Gene00002_T001.exon.1;exon_id=Gene00002_T001.exon.1

作为输入,我有gf文件和一个包含当前和新基因名称键的列表。

g754 Gene00001
g756 Gene00002

我用python编写了一个脚本,用新的基因名称替换旧的基因名称。替换命令按预期工作,但每次替换字符串后都会插入换行符。我不知道为什么会发生这种情况,谷歌让我失望了。我确实尝试模仿这里的解决方案:重命名gffile中的名称ID,但我有一个单独的基因名称密钥文件。我正在使用anaconda/python3.6


当前代码:

import sys
import getopt
import operator
in_gff = open("current_gff_file.gff3", "r")
out_gff = open("new_file.gff", "w")
name_key = open("name_key_file.txt", "r")
current_name = []
new_name = []
#create 2 lists of current and new names                                                                                                                           
for name_row in name_key:
name_field = name_row.split("t")
current_name.append(name_field[0])
new_name.append(name_field[1])
for row in in_gff:
line = row.rstrip()
if line.startswith("#"):
print(line, file = out_gff, end = "n") #if it is a header line just print to new file
else: #loop through list of current gene names
for name in range(len(current_name)):
if current_name[name] in line:                                                   
new_line = line.replace(current_name[name], new_name[name])                                                                                                                        
print(new_line) #test loop by printing to screen, line breaks happen after every string replacement
#Output I want: ID=transcript:Gene00002;Parent=gene:Gene00002;biotype=protein_coding;transcript_id=Gene00002_T001
#Output I get: ID=transcript:Gene00002
#Parent=gene:Gene00002
#biotype=protein_coding;transcript_id=Gene00002
#_T001                                                                                                     
else:
continue

遍历文件时,每行仍包含尾随换行符。在构建翻译表时将其剥离:

for name_row in name_key:
name_field = name_row.split("t")
current_name.append(name_field[0])
new_name.append(name_field[1].strip('n'))  # store stripped names only

我认为这更容易用正则表达式解决。 我将您的数据放在一个名为original.txt的文件中并new_gene_names.txt并将结果输出到output.txt.

import re
# Note there is a lookahead and a look behind
# This expression matches any string that starts with
# `gene:` and ends with `;` and the pattern pulls out
# what is between those two things.
pattern = re.compile(r'(?<=gene:)(.*?)(?=;)')
with open('new_gene_names.txt') as gene_names,
open('original.txt') as f,
open('output.txt', 'w') as out_f:
# A dictionary mapping gene names to what they will be changed to
rename_dict = dict(line.split() for line in gene_names)
for line in f:
# Search for the gene name
result = pattern.search(line)
# If we found a gene name and we can rename it then we substitute it
if result and result.group(0) in rename_dict:
line = pattern.sub(rename_dict[result.group(0)], line)
out_f.write(line)

更新 - 为了匹配Name=中的部分,只需更改正则表达式 t:

# This expression matches any string that starts with
# `gene:` and ends with `;`  OR starts with `Name=` and ends with `_`.
# and the pattern pulls out what is between the start and the end.
pattern = re.compile(r'(?<=gene:)(.*?)(?=;)|(?<=Name=)(.*?)(?=_)')

查看正则表达式的实际操作。

最新更新