我试图使用下面的awk
来获得输出看起来像期望的输出,并且在语法上遇到一些麻烦。我似乎遇到麻烦的部分是在特定目标$1
中使用这些碱基$3
少于30个读数,输出该#并计算平均值。谢谢。
awk '
{N[$1]++
T[$1]+=$4
M[$1]=$2
}
END {for (X in N) printf ("%s is %d bases and maps to %s with an average depth"
" of %f readsn", X, N[X], M[X], T[X]/N[X]);
}
' input.txt > output.txt
输入chr1:955542-955763 AGRN:exon.1 1 0
chr1:955542-955763 AGRN:exon.1 2 0
chr1:955542-955763 AGRN:exon.1 3 0
chr1:955542-955763 AGRN:exon.1 4 1
chr1:955542-955763 AGRN:exon.1 5 1
chr1:955542-955763 AGRN:exon.1 6 1
chr1:955542-955763 AGRN:exon.1 7 1
chr1:955542-955763 AGRN:exon.1 8 1
chr1:955542-955763 AGRN:exon.1 9 1
chr1:955542-955763 AGRN:exon.1 10 1
chr1:955542-955763 AGRN:exon.1 11 32
电流输出
chr1:955542-955763 is 11 bases and maps to AGRN:exon.1 with an average depth of 3.545455 reads
期望输出值
chr1:955542-955763 is 11 bases and maps to AGRN:exon.1 with an average depth of 3.54 reads and there are 10 bases less than 30 reads with an average coverage of 0.63 reads
edit(字段说明)
awk '{for (i=1; i<=NF; i++) print i, $i}' input.txt
1 chr1:955542-955763 (defines the specific target location) - variable N
2 AGRN:exon.1 (defines the name/id of the target location) - variable M
3 1 (defines the exact base on the target)
4 0 (used to calculate the average) - variable T
输出的第一部分似乎工作得很完美,它只是添加到其中以尝试获得第二部分。也就是and there are 10 bases less than 30 reads with an average coverage of 0.63 reads
,其中10
是$2
中最后一个读取数少于30的碱基。0.63
是其中$4
中所有#的平均值。我希望这对你有帮助,谢谢:)。
二维输出
Lo: chr1:955542-955763 is 10 bases and maps to AGRN:exon.1 with an average depth of 0.700000 reads
Hi: chr1:955542-955763 is 1 bases and maps to AGRN:exon.1 with an average depth of **2.909091** reads ( should be 32 - `$4` is 32 / 1)
更新答案
对于阈值的二维输出,我将对二维数组恢复为GNU awk
:
gawk '
{ i=1 # use second index of 1 for $4 < 30
if($4>=30)i=2 # use second index of 2 for $4 >= 30
N[$1][i]++
T[$1][i]+=$4
B[$1][i]++
M[$1][i]=$2
}
END {
for (X in N){
printf ("Lo: %s is %d bases and maps to %s with an average depth"
" of %f readsn", X, N[X][1], M[X][1], T[X][1]/B[X][1]);
printf ("Hi: %s is %d bases and maps to %s with an average depth"
" of %f readsn", X, N[X][2], M[X][2], T[X][2]/B[X][2]);
}
} ' input.txt
Lo: chr1:955542-955763 is 10 bases and maps to AGRN:exon.1 with an average depth of 0.700000 reads
Hi: chr1:955542-955763 is 1 bases and maps to AGRN:exon.1 with an average depth of 32.000000 reads
原始回答
我想你想要这样的东西,它忽略最后一个字段大于等于30的行:
awk '
$4 < 30 {
N[$1]++
T[$1]+=$4
B[$1]=$3
M[$1]=$2
}
END {
for (X in N) printf ("%s is %d bases and maps to %s with an average depth"
" of %f readsn", X, N[X], M[X], T[X]/B[X]);
} ' input.txt