如何用Python修改tsv-file列



我有一个GFF3文件(主要是一个有9列的TSV文件),我试图在我的文件的第一列做一些改变,以便覆盖对文件本身的修改。

GFF3文件看起来像这样:

## GFF3 file
## replicon1
## replicon2
replicon_1  prokka  gene    0   15  .   @   .   ID=some_gene_1;
replicon_1  prokka  gene    40  61  .   @   .   ID=some_gene_1;
replicon_2  prokka  gene    8   32  .   @   .   ID=some_gene_2;
replicon_2  prokka  gene    70  98  .   @   .   ID=some_gene_2;

我写了几行代码,其中我决定要更改某个符号(例如"_")和我想要替换的符号(例如"@"):

import os
import re
import argparse
import pandas as pd
def myfunc() -> tuple:
ap.add_argument("-f", "--file", help="path to file")
ap.add_argument("-i", "--input_word",help="Symbol to delete")
ap.add_argument("-o", "--output_word", help="Symbol to insert")
return ap.parse_args()
args = myfunc()
my_file = args.file
in_char = args.input_word
out_char = args.output_word
with open (my_file, 'r+') as f:
rawfl = f.read()
rawfl = re.sub(in_char, out_char, rawfl)
f.seek(0)
f.write(rawfl)
f.close()

输出如下:

## GFF3 file
## replicon1
## replicon2
replicon@1  prokka  gene    0   15  .   @   .   ID=some@gene@1;
replicon@1  prokka  gene    40  61  .   @   .   ID=some@gene@1;
replicon@2  prokka  gene    8   32  .   @   .   ID=some@gene@2;
replicon@2  prokka  gene    70  98  .   @   .   ID=some@gene@2;

可以看到,所有的"_"在"@"中已更改。我尝试使用pandas修改脚本,以便仅将修改应用于第一列(seqid,下面):

with open (my_file, 'r+') as f:
genomic_dataframe = pd.read_csv(f, sep="t", names=['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes'])
id = genomic_dataframe.seqid
id = str(id) #this is used because re.sub expects strings, not dataframe
id = re.sub(in_char, out_char, genid)
f.seek(0)
f.write(genid)
f.close()

我没有得到预期的结果,但类似于seqid列(正确修改)的东西被添加到文件中,但没有覆盖原始的

我想要得到的是这样的东西:

## GFF3 file
## replicon1
## replicon2
replicon@1  prokka  gene    0   15  .   @   .   ID=some_gene_1;
replicon@1  prokka  gene    40  61  .   @   .   ID=some_gene_1;
replicon@2  prokka  gene    8   32  .   @   .   ID=some_gene_2;
replicon@2  prokka  gene    70  98  .   @   .   ID=some_gene_2;

其中"@"符号只出现在第一列中,而"_"保存在第9列。

你知道怎么解决这个问题吗?谢谢大家。

如果您只想用@替换_的第一个出现,您可以这样做,而不需要将文件作为数据帧加载,也不需要使用任何第三方库,例如pandas.

with open('f') as f:
lines = [line.rstrip() for line in f]
for line in lines:
# Ignore comments
if line[0] == '#':
continue
line = line.replace('_', '@', 1)

返回它包含

## GFF3 file
## replicon1
## replicon2
replicon@1  prokka  gene    0   15  .   @   .   ID=some_gene_1;
replicon@1  prokka  gene    40  61  .   @   .   ID=some_gene_1;
replicon@2  prokka  gene    8   32  .   @   .   ID=some_gene_2;
replicon@2  prokka  gene    70  98  .   @   .   ID=some_gene_2;

您可以使用re.sub^(字符串的开始)开始的模式+在re.sub中使用lambda函数。例如:

import re
# change only first column:
r = re.compile(r"^(.*?)(?=s)")
in_char = "_"
out_char = "@"
with open("input_file.txt", "r") as f_in, open("output_file.txt", "w") as f_out:
for line in map(str.strip, f_in):
# skip empty lines and lines starting with ##
if not line or line.startswith("##"):
print(line, file=f_out)
continue
line = r.sub(lambda g: g.group(1).replace(in_char, out_char), line)
print(line, file=f_out)

创建output_file.txt:

## GFF3 file
## replicon1
## replicon2
replicon@1  prokka  gene    0   15  .   @   .   ID=some_gene_1;
replicon@1  prokka  gene    40  61  .   @   .   ID=some_gene_1;
replicon@2  prokka  gene    8   32  .   @   .   ID=some_gene_2;
replicon@2  prokka  gene    70  98  .   @   .   ID=some_gene_2;

最新更新