AWK:基于两列的数据后处理



我正在按以下顺序处理以多列格式排列的CSV日志的后处理:第一列对应行号(ID(,第二列包含其总体(POP,属于该ID的样本数(,第三列(dG(表示该ID的一些固有值(始终为负(:

ID, POP, dG
1, 7, -9.6000
2, 3, -8.7700
3, 6, -8.6200
4, 4, -8.2700
5, 6, -8.0800
6, 10, -8.0100
7, 9, -7.9700
8, 8, -7.8400
9, 16, -7.8100
10, 2, -7.7000
11, 1, -7.5600
12, 2, -7.5200
13, 9, -7.5100
14, 1, -7.5000
15, 2, -7.4200
16, 1, -7.3300
17, 1, -7.1700
18, 4, -7.1300
19, 3, -6.9200
20, 1, -6.9200
21, 2, -6.9100
22, 2, -6.8500
23, 10, -6.6900
24, 2, -6.6800
25, 1, -6.6600
26, 20, -6.6500
27, 1, -6.6500
28, 5, -6.5700
29, 3, -6.5500
30, 2, -6.4600
31, 2, -6.4500
32, 1, -6.3000
33, 7, -6.2900
34, 1, -6.2100
35, 1, -6.2000
36, 3, -6.1800
37, 1, -6.1700
38, 4, -6.1300
39, 1, -6.1000
40, 2, -6.0600
41, 3, -6.0600
42, 8, -6.0200
43, 2, -6.0100
44, 1, -6.0100
45, 1, -5.9800
46, 2, -5.9700
47, 1, -5.9300
48, 6, -5.8800
49, 4, -5.8300
50, 4, -5.8000
51, 2, -5.7800
52, 3, -5.7200
53, 1, -5.6600
54, 1, -5.6500
55, 4, -5.6400
56, 2, -5.6300
57, 1, -5.5700
58, 1, -5.5600
59, 1, -5.5200
60, 1, -5.5000
61, 3, -5.4200
62, 4, -5.3600
63, 1, -5.3100
64, 5, -5.2500
65, 5, -5.1600
66, 1, -5.1100
67, 1, -5.0300
68, 1, -4.9700
69, 1, -4.7700
70, 2, -4.6600

为了减少行数,我过滤了这个CSV,目的是使用以下AWK表达式搜索第二列(POP(中编号最高的行:

# search CSV for the line with the highest POP and save all lines before it, while keeping minimal number of the lines (3) in the case if this line is found at the beginning of CSV.
awk -v min_lines=3 -F ", " 'a < $2 {for(idx=0; idx < i; idx++) {print arr[idx]} print $0; a=int($2); i=0; printed=NR} a > $2 && NR > 1 {arr[i]=$0; i++}END{if(printed <= min_lines) {for(idx = 0; idx <= min_lines - printed; idx++){print arr[idx]}}}' input.csv > output.csv

从而获得以下减少的输出CSV,由于搜索字符串(具有最高POP(位于第26行,该CSV仍有许多行:

ID, POP, dG
1, 7, -9.6000
2, 3, -8.7700
3, 6, -8.6200
4, 4, -8.2700
5, 6, -8.0800
6, 10, -8.0100
7, 9, -7.9700
8, 8, -7.8400
9, 16, -7.8100
10, 2, -7.7000
11, 1, -7.5600
12, 2, -7.5200
13, 9, -7.5100
14, 1, -7.5000
15, 2, -7.4200
16, 1, -7.3300
17, 1, -7.1700
18, 4, -7.1300
19, 3, -6.9200
20, 1, -6.9200
21, 2, -6.9100
22, 2, -6.8500
23, 10, -6.6900
24, 2, -6.6800
25, 1, -6.6600
26, 20, -6.6500

如何通过修改我的AWK表达式(或将其管道传输到其他内容(来进一步自定义我的过滤器,以便额外考虑第三列的负值dG与第一行(其值最负(相比差异较小的行?例如,只考虑与第一条线相比dG不超过20%的线,同时保持所有休息条件相同:

ID, POP, dG
1, 7, -9.6000
2, 3, -8.7700
3, 6, -8.6200
4, 4, -8.2700
5, 6, -8.0800
6, 10, -8.0100
7, 9, -7.9700
8, 8, -7.8400
9, 16, -7.8100
10, 2, -7.7000

这两项任务都可以在单个awk:中完成

awk -F ', ' 'NR==1 {next} FNR==NR {if (max < $2) {max=$2; n=FNR}; if (FNR==2) dg = $3 * .8; next} $3+0 == $3 && (FNR == n+1 || $3 > dg) {exit} 1' file file
ID, POP, dG
1, 7, -9.6000
2, 3, -8.7700
3, 6, -8.6200
4, 4, -8.2700
5, 6, -8.0800
6, 10, -8.0100
7, 9, -7.9700
8, 8, -7.8400
9, 16, -7.8100
10, 2, -7.7000

使其可读性更强:

awk -F ', ' '
NR == 1 {
next
}
FNR == NR {
if (max < $2) {
max = $2
n = FNR
}
if (FNR == 2)
dg = $3 * .8
next
}
$3 + 0 == $3 && (FNR == n+1 || $3 > dg) {
exit
}
1' file file