统计每个列中出现的唯一字符串,并按列打印结果



我有一个大约有2000万行和100列的文本文件。我想输出每列中每个字符串出现次数的汇总统计信息。这是文件的一部分,包括作为示例名称的标头。

LP605  LP606   LP607   LP608   LP609
0/0 0/0 0/0 0/0 0/0
0/0 1/1 1/1 0/0 0/0
0/0 0/0 0/0 0/0 0/0
0/0 0/0 0/0 0/0 0/0
0/1 0/1 0/1 0/1 0/0
1/1 0/0 0/1 0/0 0/0
1/1 1/1 ./. 0/0 ./.
0/0 0/0 ./. 0/0 ./.
0/1 0/1 0/0 0/0 0/1

汇总统计的期望输出

Summary LP605   LP606   LP607   LP608   LP609
0/0 4   4   8   8   6
0/1 2   2   1   1   1
1/1 2   1   1   0   0
./. 0   2   2   0   2

感谢

这就完成了任务,使用awk

NR==31 {                  # File header is on line 31
print "GT", $0          # Print "GT" followed by header
n=NF                    # Record the number of columns
next                    # Stop processing
}
{                         # For every line
for(i=1; i<=NF; i++) {  # & for every column ("i")
A[i,$i]++             # Use an array A to store the the number of occurrences of the value ($i) in column i
V[$i]                 # Record all values of ($i)
}
} END {
for(j in V) {           # For all values of ($i)
$1=j                  # Assign the value to field 1
for(i=1; i<=n; i++)   # For all columns            
$(i+1)=A[i,j]+0     # Assign the number of occurrences of value "j" to the appropriate column
print                 # Print the line
}
} 
OFS='t' file             # Use tab output field separator

最新更新