我有一个包含两列的文件input.txt,我想将第二列分割为";"然后对唯一项进行转置,然后计数并列出第1列中有多少个匹配项。
这是我的tab分隔的input.txt文件
Gene Biological_Process
BALF2 metabolic process
CHD4 cell organization and biogenesis;metabolic process;regulation of biological process
TCOF1 cell organization and biogenesis;regulation of biological process;transport
TOP1 cell death;cell division;cell organization and biogenesis;metabolic process;regulation of biological process;response to stimulus
BcLF1 0
BALF5 metabolic process
MTA2 cell organization and biogenesis;metabolic process;regulation of biological process
MSH6 cell organization and biogenesis;metabolic process;regulation of biological process;response to stimulus
my expected output1
Biological_Process Gene
metabolic process BALF2 CHD4 TOP1 BALF5 MTA2 MSH6
cell organization and biogenesis CHD4 TCOF1 TOP1 MTA2 MSH6
regulation of biological process CHD4 TCOF1 TOP1 MTA2 MSH6
transport TCOF1
cell death TOP1
cell division TOP1
response to stimulus TOP1 MSH6
$ cat script.awk
#! /usr/bin/awk -f
BEGIN {
FS = "[t;]"; # sep can be a regex
OFS = "t"
}
NR>1 && /^[A-Z]/{ # skip header & blank lines
for(i=NF; i>1; i--)
if($i) # skip empty bio-proc
a[$i] = a[$i] OFS $1
}
END{
print "Biological_Process","Gene(s)"
for(x in a)
print x a[x]
}
$ ./script.awk input.dat
Biological_Process Gene(s)
cell death TOP1
regulation of biological process CHD4 TCOF1 TOP1 MTA2 MSH6
transport TCOF1
cell division TOP1
metabolic process BALF2 CHD4 TOP1 BALF5 MTA2 MSH6
response to stimulus TOP1 MSH6
cell organization and biogenesis CHD4 TCOF1 TOP1 MTA2 MSH6
你需要首先解析所有的数据,例如,从一个空白字典开始,然后读取文件的每一行(如果是标题,跳过第0行)open your file ... iterate over each line
,对于列>使用split
、strip
和dict[gene...] = process...
等字符串方法为该字符串创建一个字典键,其值为来自column = 0的字符串。然后从字典中打印/写出每个.items
:
input.txt
gene process
A cell org bio
B cell bio
C 0
D org
script.py
#!/usr/bin/env python
def main():
pros = {}
with open("input.txt", "r") as ifile:
for line in ifile:
cols = line.strip().split()
if len(cols) >= 1:
for pro in cols[1:]:
if pro not in pros:
pros[pro] = []
pros[pro] += [cols[0]]
with open("output.txt", "w") as ofile:
for key,val in pros.items():
ofile.writelines(f'{key}t' + 't'.join(val) + 'n')
if __name__ == "__main__":
main()
运行$ chmod +x ./script.py
$ ./script.py
$ cat ./output.txt
output.txt
process gene
cell A B
org A D
bio A B
0 C