在 Python 中用 " 从 infile 中拆分行



我有一系列的输入文件,如:

chr1    hg19_refFlat    exon    44160380    44160565    0.000000    +   .   gene_id "KDM4A"; transcript_id "KDM4A";
chr1    hg19_refFlat    exon    19563636    19563732    0.000000    -   .   gene_id "EMC1"; transcript_id "EMC1";
chr1    hg19_refFlat    exon    52870219    52870551    0.000000    +   .   gene_id "PRPF38A"; transcript_id "PRPF38A";
chr1    hg19_refFlat    exon    53373540    53373626    0.000000    -   .   gene_id "ECHDC2"; transcript_id "ECHDC2_dup2";
chr1    hg19_refFlat    exon    11839859    11840067    0.000000    +   .   gene_id "C1orf167"; transcript_id "C1orf167";
chr1    hg19_refFlat    exon    29037032    29037154    0.000000    +   .   gene_id "GMEB1"; transcript_id "GMEB1";
chr1    hg19_refFlat    exon    103356007   103356060   0.000000    -   .   gene_id "COL11A1"; transcript_id "COL11A1";

在我的代码中,我试图从每行捕获2个元素,第一个是在它说外显子之后的数字,第二个是基因(由"包围的数字和字母组合,例如。"KDM4A"。下面是我的代码:

    with open(infile,'r') as r:
        start = set([line.strip().split()[3] for line in r])
        genes = set([line.split('"')[1] for line in r])
        print len(start)
        print len(genes)
由于某种原因,

start工作得很好,但基因没有捕获任何东西。下面是输出:

 48050
 0

我认为这与基因名称周围的"有关但如果我在终端上输入这个,它就会正常工作:

>>> x = 'A b P "G" m'
>>> x
'A b P "G" m'
>>> x.split('"')[1]
'G'
>>> 

有任何解决方案将不胜感激?即使从每行获取两项数据的方式完全不同。由于

这是因为当你在这里循环一次时,你的文件对象已经耗尽了这里start = set([line.strip().split()[3] for line in r])再次你试图在这里循环genes = set([line.split('"')[1] for line in r])在耗尽的文件对象

解决方案:

您可以查找到文件的开头(这是解决方案之一)

修改你的代码:
with open(infile,'r') as r:
    start = set([line.strip().split()[3] for line in r])
    r.seek(0, 0)
    genes = set([line.split('"')[1] for line in r])
    print len(start)
    print len(genes)

可以使用正则表达式

with open(file) as f:
    start = []
    genes = []
    for line in f:
        st, gen = re.search(r'bexons+(d+)b.*?s+gene_ids+"([^"]*)"', line).groups()
        start.append(st)
        genes.append(gen)
    print set(start)
    print set(genes)
演示

您可以将所有行加载到列表中,然后对该列表中的每个项目执行split(如果文件很长,则不确定效率如何)

with open(infile) as r:
    lines = [line for line in r]
    start = set([line.strip().split()[3] for line in lines])
    genes = set([line.split('"')[1] for line in lines]) 

使用shlex(就像shell参数一样)来中和多个空格和引号
不确定它是否更快,但安全,有点好

import shlex
with open(infile, 'r') as f:
    for line in f:
        parts = shlex.split(line.replace(';', ''))
        print parts[3], parts[9]

加载genes失败的原因是您需要重新开始读取文件。下面的方法应该可以工作:

import re
start = set()
genes = set()
with open('input.txt', 'r') as f_input:
    for line in f_input:
        s, g = re.match(r'(?:.*?s+){3}(d+).*"(w+)"', line).groups()
        start.add(s)
        genes.add(g)
print start
print genes

输出:

set(['44160380', '29037032', '103356007', '19563636', '53373540', '52870219', '11839859'])
set(['COL11A1', 'PRPF38A', 'KDM4A', 'C1orf167', 'EMC1', 'GMEB1', 'ECHDC2_dup2'])