awk 脚本显示不正确的输出



我在脚本awk遇到问题 - 我需要生成一份报告,其中包含数据文件中每个作业的最低、最高和平均分数。作业的名称位于column 3中。

输入数据为:

Student,Catehory,Assignment,Score,Possible
Chelsey,Homework,H01,90,100
Chelsey,Homework,H02,89,100
Chelsey,Homework,H03,77,100
Chelsey,Homework,H04,80,100
Chelsey,Homework,H05,82,100
Chelsey,Homework,H06,84,100
Chelsey,Homework,H07,86,100
Chelsey,Lab,L01,91,100
Chelsey,Lab,L02,100,100
Chelsey,Lab,L03,100,100
Chelsey,Lab,L04,100,100
Chelsey,Lab,L05,96,100
Chelsey,Lab,L06,80,100
Chelsey,Lab,L07,81,100
Chelsey,Quiz,Q01,100,100
Chelsey,Quiz,Q02,100,100
Chelsey,Quiz,Q03,98,100
Chelsey,Quiz,Q04,93,100
Chelsey,Quiz,Q05,99,100
Chelsey,Quiz,Q06,88,100
Chelsey,Quiz,Q07,100,100
Chelsey,Final,FINAL,82,100
Chelsey,Survey,WS,5,5
Sam,Homework,H01,19,100
Sam,Homework,H02,82,100
Sam,Homework,H03,95,100
Sam,Homework,H04,46,100
Sam,Homework,H05,82,100
Sam,Homework,H06,97,100
Sam,Homework,H07,52,100
Sam,Lab,L01,41,100
Sam,Lab,L02,85,100
Sam,Lab,L03,99,100
Sam,Lab,L04,99,100
Sam,Lab,L05,0,100
Sam,Lab,L06,0,100
Sam,Lab,L07,0,100
Sam,Quiz,Q01,91,100
Sam,Quiz,Q02,85,100
Sam,Quiz,Q03,33,100
Sam,Quiz,Q04,64,100
Sam,Quiz,Q05,54,100
Sam,Quiz,Q06,95,100
Sam,Quiz,Q07,68,100
Sam,Final,FINAL,58,100
Sam,Survey,WS,5,5
Andrew,Homework,H01,25,100
Andrew,Homework,H02,47,100
Andrew,Homework,H03,85,100
Andrew,Homework,H04,65,100
Andrew,Homework,H05,54,100
Andrew,Homework,H06,58,100
Andrew,Homework,H07,52,100
Andrew,Lab,L01,87,100
Andrew,Lab,L02,45,100
Andrew,Lab,L03,92,100
Andrew,Lab,L04,48,100
Andrew,Lab,L05,42,100
Andrew,Lab,L06,99,100
Andrew,Lab,L07,86,100
Andrew,Quiz,Q01,25,100
Andrew,Quiz,Q02,84,100
Andrew,Quiz,Q03,59,100
Andrew,Quiz,Q04,93,100
Andrew,Quiz,Q05,85,100
Andrew,Quiz,Q06,94,100
Andrew,Quiz,Q07,58,100
Andrew,Final,FINAL,99,100
Andrew,Survey,WS,5,5
Ava,Homework,H01,55,100
Ava,Homework,H02,95,100
Ava,Homework,H03,84,100
Ava,Homework,H04,74,100
Ava,Homework,H05,95,100
Ava,Homework,H06,84,100
Ava,Homework,H07,55,100
Ava,Lab,L01,66,100
Ava,Lab,L02,77,100
Ava,Lab,L03,88,100
Ava,Lab,L04,99,100
Ava,Lab,L05,55,100
Ava,Lab,L06,66,100
Ava,Lab,L07,77,100
Ava,Quiz,Q01,88,100
Ava,Quiz,Q02,99,100
Ava,Quiz,Q03,44,100
Ava,Quiz,Q04,55,100
Ava,Quiz,Q05,66,100
Ava,Quiz,Q06,77,100
Ava,Quiz,Q07,88,100
Ava,Final,FINAL,99,100
Ava,Survey,WS,5,5
Shane,Homework,H01,50,100
Shane,Homework,H02,60,100
Shane,Homework,H03,70,100
Shane,Homework,H04,60,100
Shane,Homework,H05,70,100
Shane,Homework,H06,80,100
Shane,Homework,H07,90,100
Shane,Lab,L01,90,100
Shane,Lab,L02,0,100
Shane,Lab,L03,100,100
Shane,Lab,L04,50,100
Shane,Lab,L05,40,100
Shane,Lab,L06,60,100
Shane,Lab,L07,80,100
Shane,Quiz,Q01,70,100
Shane,Quiz,Q02,90,100
Shane,Quiz,Q03,100,100
Shane,Quiz,Q04,100,100
Shane,Quiz,Q05,80,100
Shane,Quiz,Q06,80,100
Shane,Quiz,Q07,80,100
Shane,Final,FINAL,90,100
Shane,Survey,WS,5,5

awk 脚本

BEGIN {
FS=" *\, *"
}
FNR>1 {
min[$3]=(!($3 in min) || min[$3]> $4 )? $4 : min[$3]
max[$3]=(max[$3]> $4)? max[$3] : $4
cnt[$3]++
sum[$3]+=$4
}
END {
print "NametLowtHightAverage"
for (i in cnt)
printf("%st%dt%dt%.1fn", i, min[i], max[i], sum[i]/cnt[i])
}

预期示例输出:

Name    Low     High    Average
Q06     77      95      86.80
L05     40      96      46.60
WS      5       5       5
Q07     58      100     78.80
L06     60      99      61
L07     77      86      64.80

当我运行脚本时,我得到所有分配的"低"为 0,这是不正确的。我哪里出错了?请指导。

您当然可以使用awk执行此操作,但是由于您也标记了此脚本,因此我假设其他工具是一种选择。对于这种关于数据中存在的组的统计数据的收集,GNU datamash 经常将工作简化为简单的单行。例如:

$ (echo Name,Low,High,Average; datamash --header-in -s -t, -g3 min 4 max 4 mean 4  < input.csv) | tr , 't'
Name    Low     High    Average
FINAL   58      99      85.6
H01     19      90      47.8
H02     47      95      74.6
H03     70      95      82.2
H04     46      80      65
H05     54      95      76.6
H06     58      97      80.6
H07     52      90      67
L01     41      91      75
L02     0       100     61.4
L03     88      100     95.8
L04     48      100     79.2
L05     0       96      46.6
L06     0       99      61
L07     0       86      64.8
Q01     25      100     74.8
Q02     84      100     91.6
Q03     33      100     66.8
Q04     55      100     81
Q05     54      99      76.8
Q06     77      95      86.8
Q07     58      100     78.8
WS      5       5       5

这表示,对于具有相同值的第 3 列(-g3,加上-s对带有标题 (--header-in( 的简单 CSV 输入 (-t,( 的输入进行排序(工具的要求((的每个组,显示第 4 列的最小值、最大值和平均值。所有这些都被赋予一个新的标头,并通过管道传输到tr,以将逗号转换为选项卡。

你的代码在 GNU awk 中按原样工作。但是,使用-t选项运行它以警告不可移植构造会得到:

awk: foo.awk:6: warning: old awk does not support the keyword `in' except after `for'
awk: foo.awk:2: warning: old awk does not support regexps as value of `FS'

使用不同的 awk 实现(在我的情况下mawk(运行脚本确实为 Low 列提供了 0。因此,对脚本进行了一些调整:

BEGIN {
FS=","
}
FNR>1 {
min[$3]=(cnt[$3] == 0 || min[$3]> $4 )? $4 : min[$3]
max[$3]=(max[$3]> $4)? max[$3] : $4
cnt[$3]++
sum[$3]+=$4
}
END {
print "NametLowtHightAverage"
PROCINFO["sorted_in"] = "@ind_str_asc" # gawk-ism for pretty output; ignored on other awks
for (i in cnt)
printf("%st%dt%dt%.1fn", i, min[i], max[i], sum[i]/cnt[i])
}

它在另一个 awk 上也按预期工作。

更改:

  • 使用简单的逗号作为字段分隔符而不是正则表达式。
  • 通过检查cnt[$3]是否等于 0(这将是第一次,因为该值在后面的行中递增(或当前最小值是否大于此值,将 min 条件更改为在第一次看到此赋值时设置为当前值。

另一种类似的方法

$ awk -F, 'NR==1 {print "name","low","high","average"; next} 
{k=$3; sum[k]+=$4; count[k]++}
!(k in min) {min[k]=max[k]=$4} 
min[k]>$4 {min[k]=$4} 
max[k]<$4 {max[k]=$4}                    
END       {for(k in min) print k,min[k],max[k],sum[k]/count[k]}' file | 
column -t
name   low  high  average
Q06    77   95    86.8
L05    0    96    46.6
WS     5    5     5
Q07    58   100   78.8
L06    0    99    61
L07    0    86    64.8
H01    19   90    47.8
H02    47   95    74.6
H03    70   95    82.2

最新更新