使用Python操作键值分组的txt文件表示



我试图使用Python来操纵格式a的文本文件:

Key1  
Key1value1  
Key1value2  
Key1value3  
Key2  
Key2value1  
Key2value2  
Key2value3  
Key3... 

Into Format B:

Key1 Key1value1  
Key1 Key1value2  
Key1 Key1value3  
Key2 Key2value1  
Key2 Key2value2  
Key2 Key2value3  
Key3 Key3value1...

具体来说,这里简要介绍一下文件本身(只显示一个键,整个文件中有数千个键):

chr22:16287243: PASS  
patientID1  G/G  
patientID2  G/G  
patient ID3 G/G

和这里想要的输出:

chr22:16287243: PASS  patientID1    G/G  
chr22:16287243: PASS  patientID2    G/G  
chr22:16287243: PASS  patientID3    G/G

我已经编写了以下代码,可以检测/显示键,但我有麻烦编写代码来存储与每个键相关联的值,并随后打印这些键值对。有人能帮我做这项工作吗?

import sys
import re
records=[]
with open('filepath', 'r') as infile:
    for line in infile:
        variant = re.search("Achrd",line, re.I) # all variants start with "chr"
        if variant:
            records.append(line.replace("n",""))
            #parse lines until a new variant is encountered
for r in records:
    print (r)

一次性完成,不存储以下行:

with open("input") as infile, open("ouptut", "w") as outfile:
    for line in infile:
        if line.startswith("chr"):
            key = line.strip()
        else:
            print >> outfile, key, line.rstrip("n")

此代码假设第一行包含一个键,否则将失败。

首先,如果字符串以字符序列开头,不要使用正则表达式。更简单,更容易阅读:

if line.startswith("chr")

下一步是使用一个非常简单的状态机。像这样:

current_key = ""
for line in file:
    if line.startswith("chr"):
        current_key = line.strip()
    else:
        print " ".join([current_key, line.strip()])

如果每个键的值数量总是相同的,那么islice是有用的:

from itertools import islice
with open('input.txt') as fin, open('output.txt','w') as fout:
    for k in fin:
        for v in islice(fin,3):
            fout.write(' '.join((k.strip(),v)))

最新更新