我有一个大约有2000万行和100列的文本文件。我想输出每列中每个字符串出现次数的汇总统计信息。这是文件的一部分,包括作为示例名称的标头。
LP605 LP606 LP607 LP608 LP609
0/0 0/0 0/0 0/0 0/0
0/0 1/1 1/1 0/0 0/0
0/0 0/0 0/0 0/0 0/0
0/0 0/0 0/0 0/0 0/0
0/1 0/1 0/1 0/1 0/0
1/1 0/0 0/1 0/0 0/0
1/1 1/1 ./. 0/0 ./.
0/0 0/0 ./. 0/0 ./.
0/1 0/1 0/0 0/0 0/1
汇总统计的期望输出
Summary LP605 LP606 LP607 LP608 LP609
0/0 4 4 8 8 6
0/1 2 2 1 1 1
1/1 2 1 1 0 0
./. 0 2 2 0 2
感谢
这就完成了任务,使用awk
NR==31 { # File header is on line 31
print "GT", $0 # Print "GT" followed by header
n=NF # Record the number of columns
next # Stop processing
}
{ # For every line
for(i=1; i<=NF; i++) { # & for every column ("i")
A[i,$i]++ # Use an array A to store the the number of occurrences of the value ($i) in column i
V[$i] # Record all values of ($i)
}
} END {
for(j in V) { # For all values of ($i)
$1=j # Assign the value to field 1
for(i=1; i<=n; i++) # For all columns
$(i+1)=A[i,j]+0 # Assign the number of occurrences of value "j" to the appropriate column
print # Print the line
}
}
OFS='t' file # Use tab output field separator