在python中从平面文件创建指纹文件



我有另一个新手python问题。我有一个如下所示的文件。我需要将其转换为类似矢量和指纹的形式。对我来说,问题是如何组合文件,这样在最终的矩阵中,行是cmp,列是val……如果comp缺少val,则等于零。cmp的vals是不同的,并且重叠不是很大。你能建议去哪里更好吗?Python字典?任何想法都有帮助。谢谢!

cmp1    0.277   val_1
cmp1    0.097   val_2
cmp1    0.795   val_3
cmp1    0.809   val_4
cmp1    0.127   val_5
cmp2    0.839   val_3
cmp2    0.909   val_4
cmp2    0.148   val_5
cmp2    0.938   val_6
cmp2    0.599   val_7

我需要接收的结果

矢量版

name    val_1   val_2   val_3   val_4   val_5   val_6   val_7
cmp1    0.277   0.097   0.795   0.809   0.127   0   0
cmp2    0   0   0.839   0.909   0.148   0.938   0.599   

二进制版本

name    val_1   val_2   val_3   val_4   val_5   val_6   val_7
cmp1    0   0   1   1   0   0   0
cmp2    0   0   1   1   0   1   1

当前代码

import csv
fi = open("data.txt", "rb")
fo = open("data_out.txt", "wb")
reader = csv.reader(fi,delimiter='t')
writer = csv.writer(fo,delimiter='t')
# making unique lists
targets = set()
ligands = set()
for row in reader:
    ligands.add(row[0])
    targets.add(row[2])
data = []
for row in reader:
    if row[0] in ligands and row[2] in targets:
    else: 

您可以在此处使用collections.defaultdict

from collections import defaultdict
with open('abc') as f:
    dic = defaultdict(dict)
    for line in f:
        cmp, val, col = line.split()
        dic[cmp][col] = val
print dic
# defaultdict(<type 'dict'>,
 #{'cmp1': {'val_5': '0.127', 'val_4': '0.809', 'val_1': '0.277', 'val_3': '0.795', 'val_2': '0.097'},
 # 'cmp2': {'val_5': '0.148', 'val_4': '0.909', 'val_7': '0.599', 'val_6': '0.938', 'val_3': '0.839'}})
#get a sroted list of all val_i from the dic        
vals = sorted(set(y for x in dic.itervalues() for y in x))
keys = sorted(dic)
print "name    {}".format("t".join(vals))
for key in keys:
    print "{}    {}".format(key, "t".join(dic[key].get(v,'0')  for v in vals)  )

输出:

name    val_1   val_2   val_3   val_4   val_5   val_6   val_7
cmp1    0.277   0.097   0.795   0.809   0.127   0   0
cmp2    0   0   0.839   0.909   0.148   0.938   0.599

对于二进制版本,您可以尝试:

print "name    {}".format("t".join(vals))
for key in keys:
    strs = "t".join(str(int(round(float(dic[key][v])))) if v in dic[key] else '0'  for v in vals)
    print "{}    {}".format(key, strs )

输出:

name    val_1   val_2   val_3   val_4   val_5   val_6   val_7
cmp1    0   0   1   1   0   0   0
cmp2    0   0   1   1   0   1   1

最新更新