AWK:根据TWO列信息过滤数据



我正在对以多列格式排列的多列CSV进行后处理:

ID, POP, dG
1, 10, -5.6200
2, 4, -5.4900
3, 1, -5.3000
4, 4, -5.1600
5, 4, -4.8800
6, 3, -4.7600
7, 2, -4.4900
8, 5, -4.4500
9, 2, -4.4400
10, 8, -4.1400
11, 1, -4.1200
12, 2, -4.0900
13, 5, -4.0100
14, 1, -3.9500
15, 3, -3.9200
16, 10, -3.8800
17, 1, -3.8700
18, 3, -3.8300
19, 1, -3.8200
20, 3, -3.8000

以前我曾使用以下AWK解决方案两次处理inout日志,检测pop(MAX(并保存匹配$2>(.8*最大值(':

awk -F ', ' 'NR == 1 {next} FNR==NR {if (max < $2) {max=$2; n=FNR+1} next} FNR <= 2 || (FNR == n && $2 > (.4*max)) || $2 > (.8 * max)' input.csv{,} > output.csv

这可以减少只保留两个具有最高POP的linne的输入日志:

ID, POP, dG
1, 10, -5.6200
16, 10, -3.8800

现在我需要改变搜索算法,同时考虑第二列(POP(和第三列(dG(:I(总是以第一行为参考,它总是在第三列中有最多的负数;ii(找到在第二列中具有最大数字的行pop(MAX(;iii(取(i(和(ii(之间的所有linnes,这些linnes将与适用于BOTH列的以下规则相匹配:a( 第三列中的行应该有(负数(,按照规则匹配:$1>(.5*$1(min((',其中$1(min(是第一行的数字(dG((总是最负的(b( 另外,第二行应该匹配具有降低的阈值的第二列的旧规则:$2=或>(.5*max(',其中max是pop(max(

因此,预期输出应该是

ID, POP, dG
1, 10, -5.6200.  # this is the first line with most negative dG
8, 5, -4.4500   # this has POP (5) and dG (-4.4500) matching the both rules
10, 8, -4.1400. # this has POP (8) and dG (-4.1400) matching the both rules    
16, 10, -3.8800  # this is pop max, with higher POP

增加8-04:

对于第一行具有非常低POP(与规则$2>=(.5*maxPop(不匹配(的情况

ID, POP, dG
1, 5, -5.5600
2, 7, -5.3300
3, 7, -5.1900
4, 1, -4.6800
5, 1, -4.5800
6, 5, -4.5600
7, 3, -4.4700
8, 4, -4.4300
9, 9, -4.4200
10, 4, -4.4200
11, 2, -4.3800
12, 4, -4.3400
13, 25, -4.3000
14, 6, -4.2900
15, 8, -4.2600
16, 3, -4.2300
17, 1, -4.1800
18, 3, -4.1300
19, 1, -4.1300
20, 1, -4.1200
21, 27, -4.0800
22, 2, -4.0300

输出也不应该包含第一行,同时仍然使用其来自dG列的值作为第二个条件($3<=(.5*minD(的参考,该条件应用于选择输出中的其他linnes:

13, 25, -4.3000
21, 27, -4.0800

您可以使用以下awk解决方案:

awk -F ', ' 'NR == 1 {next} FNR==NR {if (maxP < $2) maxP=$2; if (minD=="" || minD > $3) minD=$3; next} FNR <= 2 || ($2 >= (.5 * maxP) && $3 <= (.5 * minD))' file{,}
ID, POP, dG
1, 10, -5.6200
8, 5, -4.4500
10, 8, -4.1400
13, 5, -4.0100
16, 10, -3.8800

使其可读性更强:

awk -F ', ' '
NR == 1 {next}                   # skip 1st record 1st time
FNR == NR {
if (maxP < $2)                # compute max(POP)
maxP = $2
if (minD == "" || minD > $3)  # compute min(dG)
minD = $3
next
}
# print if 1st 2 lines OR "$2 >= .5 * max(POP) && $3 <= .5 * min(dG)"
FNR <= 2 || ($2 >= (.5 * maxP) && $3 <= (.5 * minD))
' file{,}

最新更新