如何转置一列中的唯一项并从另一列中打印值



我有一个包含两列的文件input.txt,我想将第二列分割为";"然后对唯一项进行转置,然后计数并列出第1列中有多少个匹配项。

这是我的tab分隔的input.txt文件

Gene     Biological_Process
BALF2   metabolic process
CHD4    cell organization and biogenesis;metabolic process;regulation of biological process
TCOF1   cell organization and biogenesis;regulation of biological process;transport
TOP1    cell death;cell division;cell organization and biogenesis;metabolic process;regulation of biological process;response to stimulus
BcLF1   0
BALF5   metabolic process
MTA2    cell organization and biogenesis;metabolic process;regulation of biological process
MSH6    cell organization and biogenesis;metabolic process;regulation of biological process;response to stimulus

my expected output1

Biological_Process  Gene
metabolic process   BALF2   CHD4    TOP1    BALF5   MTA2    MSH6
cell organization and biogenesis    CHD4    TCOF1   TOP1    MTA2    MSH6
regulation of biological process    CHD4    TCOF1   TOP1    MTA2    MSH6
transport   TCOF1
cell death  TOP1
cell division   TOP1
response to stimulus    TOP1    MSH6
$ cat script.awk 
#! /usr/bin/awk -f 
BEGIN {
FS = "[t;]";  # sep can be a regex
OFS = "t"
}
NR>1 && /^[A-Z]/{  # skip header & blank lines 
for(i=NF; i>1; i--)
if($i)   # skip empty bio-proc
a[$i] = a[$i] OFS $1 
}
END{
print "Biological_Process","Gene(s)"
for(x in a)
print x a[x] 
}
$ ./script.awk input.dat 
Biological_Process  Gene(s)
cell death  TOP1
regulation of biological process    CHD4    TCOF1   TOP1    MTA2    MSH6
transport   TCOF1
cell division   TOP1
metabolic process   BALF2   CHD4    TOP1    BALF5   MTA2    MSH6
response to stimulus    TOP1    MSH6
cell organization and biogenesis    CHD4    TCOF1   TOP1    MTA2    MSH6

你需要首先解析所有的数据,例如,从一个空白字典开始,然后读取文件的每一行(如果是标题,跳过第0行)open your file ... iterate over each line,对于列>使用splitstripdict[gene...] = process...等字符串方法为该字符串创建一个字典键,其值为来自column = 0的字符串。然后从字典中打印/写出每个.items:

input.txt

gene process
A cell org bio
B cell bio
C 0
D org

script.py

#!/usr/bin/env python
def main():
pros = {}
with open("input.txt", "r") as ifile:
for line in ifile:
cols = line.strip().split()
if len(cols) >= 1:
for pro in cols[1:]:
if pro not in pros:
pros[pro] = []
pros[pro] += [cols[0]]
with open("output.txt", "w") as ofile:
for key,val in pros.items():
ofile.writelines(f'{key}t' + 't'.join(val) + 'n')
if __name__ == "__main__":
main()

运行
$ chmod +x ./script.py
$ ./script.py
$ cat ./output.txt

output.txt

process gene
cell    A       B
org     A       D
bio     A       B
0       C

最新更新