我正在处理一个csv文件(100行(,其中包含以下数据。我想以csv/tab格式获得每个元素的每个基因的计数。
输入
Gene Element
---------- ----------
STBZIP1 G-box
STBZIP1 G-box
STBZIP1 MYC
STBZIP1 MYC
STBZIP1 MYC
STBZIP10 MYC
STBZIP10 MYC
STBZIP10 MYC
STBZIP10 G-box
STBZIP10 G-box
STBZIP10 G-box
STBZIP10 G-box
预期输出
Gene G-Box MYC
---------- ------- -----
STBZIP1 2 3
STBZIP10 4 3
有人能帮我想出一个关于这方面的bash脚本(或python(吗?
更新
我正在尝试以下操作,但暂时卡住了:|;
import pandas as pd
df = pd.read_csv("Promoter_Element_Distribution.csv")
print (df)
df.groupby(['Gene', 'Element']).size().unstack(fill_value=0)
由于您也要求bash版本,因此这里使用了awk
1。它被评论,并且输出是";"井";格式化,所以代码有点大(大约20行没有注释(。
awk '# First record line:
# Storing all column names into elements, including
# the first column name
NR == 1 {firstcol=$1;element[$1]++}
# Each line starting with the second one are datas
# Occurrences are counted with an indexed array
# count[x][y] contains the count of Element y for the Gene x
NR > 2 {element[$2]++;count[$1][$2]++}
# Done, time for displaying the results
END {
# Let us display the first line, column names
## Left-justify the first col, because it is text
printf "%-10s ", firstcol
## Other are counts, so we right-justify
for (i in element) if (i != firstcol) printf "%10s ", i
printf "n"
# Now an horizontal bar
for (i in element) {
c = 0
while (c++ < 10) { printf "-"}
printf " ";
}
printf "n"
# Now, loop through the count records
for (i in count) {
# Left justification for the column name
printf "%-10s ", i ;
for(j in element)
# For each counted element (ie except the first one),
# print it right-justified
if (j in count[i]) printf "%10s", count[i][j]
printf "n"
}
}' tab-separated-input.txt
结果:
Gene G-box MYC
---------- ---------- ----------
STBZIP10 4 3
STBZIP1 2 3
1此解决方案需要GNUawk
用于数组(count[$1][$2]
语法(-感谢Ed Morton
使用以下格式的文件(此处命名为input.csv
(:
Gene Element
---------- ----------
STBZIP1 G-box
STBZIP1 G-box
STBZIP1 MYC
STBZIP1 MYC
STBZIP1 MYC
STBZIP10 MYC
STBZIP10 MYC
STBZIP10 MYC
STBZIP10 G-box
STBZIP10 G-box
STBZIP10 G-box
STBZIP10 G-box
这个
import pandas as pd
df = pd.read_csv('input.csv', delim_whitespace=True, skiprows=1)
df.columns = ['Gene', 'Element']
df['Count'] = 1
df = df.pivot_table(index='Gene', columns='Element', aggfunc=sum)
print(df)
给你
Count
Element G-box MYC
Gene
STBZIP1 2 3
STBZIP10 4 3