如何从csv中提取基于列中多个描述符的行计数，然后使用bash/python脚本导出新的csv

我正在处理一个csv文件(100行(，其中包含以下数据。我想以csv/tab格式获得每个元素的每个基因的计数。

输入

Gene     Element   
---------- ---------- 
STBZIP1    G-box     
STBZIP1    G-box     
STBZIP1    MYC       
STBZIP1    MYC       
STBZIP1    MYC       
STBZIP10   MYC       
STBZIP10   MYC       
STBZIP10   MYC       
STBZIP10   G-box     
STBZIP10   G-box     
STBZIP10   G-box     
STBZIP10   G-box

预期输出

Gene     G-Box   MYC  
---------- ------- ----- 
STBZIP1        2     3  
STBZIP10       4     3

有人能帮我想出一个关于这方面的bash脚本(或python(吗？

更新

我正在尝试以下操作，但暂时卡住了：|；

import pandas as pd
df = pd.read_csv("Promoter_Element_Distribution.csv")
print (df)
df.groupby(['Gene', 'Element']).size().unstack(fill_value=0)

由于您也要求bash版本，因此这里使用了awk¹。它被评论，并且输出是"；"井"；格式化，所以代码有点大(大约20行没有注释(。

awk '# First record line:
# Storing all column names into elements, including
# the first column name
NR == 1 {firstcol=$1;element[$1]++}
# Each line starting with the second one are datas
# Occurrences are counted with an indexed array
# count[x][y] contains the count of Element y for the Gene x
NR > 2 {element[$2]++;count[$1][$2]++} 
# Done, time for displaying the results
END {
# Let us display the first line, column names
## Left-justify the first col, because it is text
printf "%-10s ", firstcol
## Other are counts, so we right-justify
for (i in element) if (i != firstcol) printf "%10s ", i
printf "n"

# Now an horizontal bar
for (i in element) {
c = 0
while (c++ < 10) { printf "-"}
printf " ";
} 
printf "n"
# Now, loop through the count records
for (i in count) {
# Left justification for the column name
printf "%-10s ", i ;
for(j in element)
# For each counted element (ie except the first one),
# print it right-justified
if (j in count[i]) printf "%10s", count[i][j]
printf "n"
}
}' tab-separated-input.txt

结果：

Gene            G-box        MYC 
---------- ---------- ---------- 
STBZIP10            4         3
STBZIP1             2         3

¹此解决方案需要GNUawk用于数组(count[$1][$2]语法(-感谢Ed Morton

使用以下格式的文件(此处命名为input.csv(：

Gene     Element   
---------- ---------- 
STBZIP1    G-box     
STBZIP1    G-box     
STBZIP1    MYC       
STBZIP1    MYC       
STBZIP1    MYC       
STBZIP10   MYC       
STBZIP10   MYC       
STBZIP10   MYC       
STBZIP10   G-box     
STBZIP10   G-box     
STBZIP10   G-box     
STBZIP10   G-box

这个

import pandas as pd
df = pd.read_csv('input.csv', delim_whitespace=True, skiprows=1)
df.columns = ['Gene', 'Element']
df['Count'] = 1
df = df.pivot_table(index='Gene', columns='Element', aggfunc=sum)
print(df)

给你

Count    
Element  G-box MYC
Gene              
STBZIP1      2   3
STBZIP10     4   3

相关内容

最新更新

热门标签：