Awk根据特定规则输出文件

  • 本文关键字:规则 输出 文件 Awk awk
  • 更新时间 :
  • 英文 :


我试图使用下面的awk来获得输出看起来像期望的输出,并且在语法上遇到一些麻烦。我似乎遇到麻烦的部分是在特定目标$1中使用这些碱基$3少于30个读数,输出该#并计算平均值。谢谢。

awk '
    {N[$1]++
     T[$1]+=$4
     M[$1]=$2
    }
END     {for (X in N) printf ("%s is %d bases and maps to %s with an average depth"
                            " of %f readsn", X, N[X], M[X], T[X]/N[X]);
    }
'  input.txt > output.txt
输入

chr1:955542-955763  AGRN:exon.1 1   0
chr1:955542-955763  AGRN:exon.1 2   0
chr1:955542-955763  AGRN:exon.1 3   0
chr1:955542-955763  AGRN:exon.1 4   1
chr1:955542-955763  AGRN:exon.1 5   1
chr1:955542-955763  AGRN:exon.1 6   1
chr1:955542-955763  AGRN:exon.1 7   1
chr1:955542-955763  AGRN:exon.1 8   1
chr1:955542-955763  AGRN:exon.1 9   1
chr1:955542-955763  AGRN:exon.1 10  1
chr1:955542-955763  AGRN:exon.1 11  32

电流输出

chr1:955542-955763 is 11 bases and maps to AGRN:exon.1 with an average depth of 3.545455 reads

期望输出值

chr1:955542-955763 is 11 bases and maps to AGRN:exon.1 with an average depth of 3.54 reads and there are 10 bases less than 30 reads with an average coverage of 0.63 reads

edit(字段说明)

awk '{for (i=1; i<=NF; i++) print i, $i}' input.txt
1 chr1:955542-955763 (defines the specific target location) - variable N
2 AGRN:exon.1  (defines the name/id of the target location) - variable M
3 1   (defines the exact base on the target)
4 0    (used to calculate the average) - variable T

输出的第一部分似乎工作得很完美,它只是添加到其中以尝试获得第二部分。也就是and there are 10 bases less than 30 reads with an average coverage of 0.63 reads

,其中10$2中最后一个读取数少于30的碱基。0.63是其中$4中所有#的平均值。我希望这对你有帮助,谢谢:)。

二维输出

Lo: chr1:955542-955763 is 10 bases and maps to AGRN:exon.1 with an average depth of 0.700000 reads
Hi: chr1:955542-955763 is 1 bases and maps to AGRN:exon.1 with an average depth of **2.909091** reads  ( should be 32 - `$4` is 32 / 1)

更新答案

对于阈值的二维输出,我将对二维数组恢复为GNU awk:

gawk '
    {  i=1                 # use second index of 1 for $4 < 30
       if($4>=30)i=2       # use second index of 2 for $4 >= 30
       N[$1][i]++
       T[$1][i]+=$4
       B[$1][i]++
       M[$1][i]=$2
    }
    END {
       for (X in N){
          printf ("Lo: %s is %d bases and maps to %s with an average depth"
                            " of %f readsn", X, N[X][1], M[X][1], T[X][1]/B[X][1]);
          printf ("Hi: %s is %d bases and maps to %s with an average depth"
                            " of %f readsn", X, N[X][2], M[X][2], T[X][2]/B[X][2]);
       }
    }    ' input.txt

Lo: chr1:955542-955763 is 10 bases and maps to AGRN:exon.1 with an average depth of 0.700000 reads
Hi: chr1:955542-955763 is 1 bases and maps to AGRN:exon.1 with an average depth of 32.000000 reads

原始回答

我想你想要这样的东西,它忽略最后一个字段大于等于30的行:

awk '
    $4 < 30 {
       N[$1]++
       T[$1]+=$4
       B[$1]=$3
       M[$1]=$2
    }
    END {
       for (X in N) printf ("%s is %d bases and maps to %s with an average depth"
                            " of %f readsn", X, N[X], M[X], T[X]/B[X]);
    } ' input.txt

最新更新