使用 AWK 对文件中的多个模式进行计数



>我尝试使用 awk 从日志文件中解析 IP 和出现次数,我希望输出如下所示:

IP1 occurences
IP2 ocurrences

我拥有的日志是这样的

2023-03-30 07:14:32.494 INFO  [asda/1737.140291506677312] cmd <IP> has been allocated

我尝试使用以下代码

awk '/100.64./{for(i=1;i<=NF;++i)if($i~/100.64./){a[$i]++} {for (x in a) print x,a[x]}}'

但我得到了一个混合列表,其中包含非唯一 IP 和许多不同的计数。代码有什么问题?

提前感谢,

例:

2023-03-30 10:39:31.214 INFO  [kea-dhcp4.leases/1737.140291531855424] DHCP4_LEASE_ALLOC [hwtype=1<REDACTED>: lease 100.64.147.36 has been allocated for 43200 seconds
2023-03-30 10:39:31.598 INFO  [kea-dhcp4.leases/1737.140291506677312] DHCP4_LEASE_ALLOC [hwtype=1<REDACTED>: lease 100.64.146.13 has been allocated for 43200 seconds
2023-03-30 10:39:31.745 INFO  [kea-dhcp4.leases/1737.140291523462720] DHCP4_LEASE_ALLOC [hwtype=1<REDACTED>: lease 100.64.146.4 has been allocated for 43200 seconds
2023-03-30 10:39:32.396 INFO  [kea-dhcp4.leases/1737.140291515070016] DHCP4_LEASE_ALLOC [hwtype=1<REDACTED>: lease 100.64.147.17 has been allocated for 43200 seconds
2023-03-30 10:39:32.466 INFO  [kea-dhcp4.leases/1737.140291531855424] DHCP4_LEASE_ALLOC [hwtype=1<REDACTED>: lease 100.64.144.122 has been allocated for 43200 seconds
2023-03-30 10:39:33.079 INFO  [kea-dhcp4.leases/1737.140291506677312] DHCP4_LEASE_ALLOC [hwtype=1<REDACTED>: lease 100.64.144.161 has been allocated for 43200 seconds
2023-03-30 10:39:33.220 INFO  [kea-dhcp4.leases/1737.140291523462720] DHCP4_LEASE_ALLOC [hwtype=1<REDACTED>: lease 100.64.144.77 has been allocated for 43200 seconds
2023-03-30 10:39:33.407 INFO  [kea-dhcp4.leases/1737.140291515070016] DHCP4_LEASE_ALLOC [hwtype=1<REDACTED>: lease 100.64.147.66 has been allocated for 43200 seconds
2023-03-30 10:39:33.427 INFO  [kea-dhcp4.leases/1737.140291531855424] DHCP4_LEASE_ALLOC [hwtype=1<REDACTED>: lease 100.64.149.201 has been allocated for 43200 seconds
2023-03-30 10:39:33.839 INFO  [kea-dhcp4.leases/1737.140291506677312] DHCP4_LEASE_ALLOC [hwtype=1<REDACTED>: lease 100.64.146.45 has been allocated for 43200 seconds
2023-03-30 10:39:34.530 INFO  [kea-dhcp4.leases/1737.140291523462720] DHCP4_LEASE_ALLOC [hwtype=1<REDACTED>: lease 100.64.144.47 has been allocated for 43200 seconds
2023-03-30 10:39:35.098 INFO  [kea-dhcp4.leases/1737.140291515070016] DHCP4_LEASE_ALLOC [hwtype=1<REDACTED>: lease 100.64.144.30 has been allocated for 43200 seconds
2023-03-30 10:39:35.249 INFO  [kea-dhcp4.leases/1737.140291531855424] DHCP4_LEASE_ALLOC [hwtype=1<REDACTED>: lease 100.64.147.54 has been allocated for 43200 seconds
2023-03-30 10:39:36.081 INFO  [kea-dhcp4.leases/1737.140291506677312] DHCP4_LEASE_ALLOC [hwtype=1<REDACTED>: lease 100.64.144.77 has been allocated for 43200 seconds

结果:

100.64.147.36 1
100.64.147.36 1
100.64.146.13 1
100.64.147.36 1
100.64.146.13 1
100.64.146.4 1
100.64.147.36 1
...
...
...
100.64.147.36 1
100.64.144.30 1
100.64.144.161 1
100.64.147.66 1
100.64.146.13 1
100.64.149.201 1
100.64.146.4 1
100.64.144.47 1
100.64.146.45 1
100.64.144.122 1
100.64.144.77 2
100.64.147.17 1

预期成果:

100.64.144.122 1
100.64.144.161 1
100.64.144.30 1
100.64.144.47 1
100.64.144.77 2
100.64.146.13 1
100.64.146.4 1
100.64.146.45 1
100.64.147.17 1
100.64.147.36 1
100.64.147.54 1
100.64.147.66 1
100.64.149.201 1
$ awk '{cnt[$8]++} END{for (i in cnt) print i, cnt[i]}' file
100.64.147.54 1
100.64.147.36 1
100.64.144.30 1
100.64.144.161 1
100.64.147.66 1
100.64.146.13 1
100.64.149.201 1
100.64.146.4 1
100.64.144.47 1
100.64.146.45 1
100.64.144.122 1
100.64.144.77 2
100.64.147.17 1

如果您的输入确实包含不以100.64.开头的 IP 地址,并且您只需要100.64.*个 IP 地址的计数,那么它将是:

awk '$8 ~ /^100.64./{cnt[$8]++} END{for (i in cnt) print i, cnt[i]}' file

代码的主要问题:

awk '/100.64./{for(i=1;i<=NF;++i)if($i~/100.64./){a[$i]++} {for (x in a) print x,a[x]}}'

是:

  1. 正在循环访问行中的每个字段,当您只想在给定您提供的示例的字段 8 上进行匹配时,$i~/100.64./要,并且
  2. 100.64.中的.不会随进行转义,因此每个.都将匹配任何字符,以便在"[kea-dhcp4.leases/1737.100264506677312]"或类似内容,以及
  3. 您不会将正则表达式锚定在前面^以便100.64.匹配10.100.64.17或类似内容,并且
  4. 每次读取一行输入时都会打印数组内容,而不是在读取所有输入后在END部分中打印一次。

假设:

  • 我们想计算所有 IP 地址,无论行的格式如何
  • 并非日志文件中的所有行都包含 IP 地址
  • 没有线路具有多个 IP 地址

对于建议的代码,文件的格式无关紧要,因此出于演示目的,我使用的是减少/修改的数据集:

$ cat ip.log
2023-03-30 10:39:31.214 INFO  [dhcp4] DLA [ht=1<RED>: lease 100.64.147.36 allocated
2023-03-30 10:39:31.598 INFO  [dhcp4] DLA [ht=1<RED>: lease 100.64.146.13 allocated
2023-03-30 10:39:31.745 INFO  [dhcp4] DLA [ht=1<RED>: lease 100.64.146.4 allocated
2023-03-30 10:39:32.396 INFO  [dhcp4] DLA [ht=1<RED RELEASE>: release 100.64.147.36 deallocated
2023-03-30 10:39:32.466 INFO  [dhcp4] DLA [ht=1<RED NOTHING TO SEE>: nothing to see here
2023-03-30 10:39:33.079 ERROR  [dhcp4] DLA [ht=1<RED ERROR>: failed 100.64.144.161

一个awk想法:

awk '
{ if ( match($0,/[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}/) )  # if we find a 4-tuple string (aka ip address) ...
counts[substr($0,RSTART,RLENGTH)]++                             # use the ip as an array index and increment the count stored in the array
}
END { for (ip in counts) print ip,counts[ip] }                           # loop through array entries and print ip address and count
' ip.log

这将生成:

100.64.147.36 2
100.64.144.161 1
100.64.146.13 1
100.64.146.4 1

笔记:

  • 如果 OP 需要输出排序一个(简单)解决方案是将此输出通过管道传输到所需的sort命令
  • 这也将匹配像999.999.999.999这样的非IP字符串;如果这是一个问题,可以修改正则表达式以减少这些类型的无效匹配
  • 编码后,这将与IPv6地址不匹配

使用以下命令:

grep -Po 'leases+KS+' infile | sort | uniq -c | perl -lane 'print join "t", @F[1, 0]; '

在这里,GNUgrep使用以下选项:
-P: 使用 Perl regexes.
-o:仅打印匹配项(每行 1 个匹配项),而不是整行。

K:使正则表达式引擎"保留"它在K之前匹配的所有内容,而不是将其包含在匹配中。具体来说,在打印匹配项时忽略正则表达式的前一部分。

sort:对grep(IP 地址)的输出进行排序,这是uniq所需的 .
uniq -c:计算排序的 IP 地址的出现次数.
perl -lane 'print join "t", @F[1, 0];:颠倒字段的顺序:打印 IP,然后是出现次数。

Perl 单行代码使用这些命令行标志:
-e:告诉 Perl 在内联而不是在文件中查找代码.
-n:一次循环一行输入,默认情况下将其分配给$_.
-l:在内联执行代码之前去除输入行分隔符(默认情况下在 *NIX 上"n"), 并在打印时附加它.
-a:将$_拆分为数组@F在空格或-F选项中指定的正则表达式上。

另请参阅:
perldoc perlrun: 如何执行 Perl 解释器: 命令行开关
perldoc perlrequick: Perl 正则表达式快速入门

使用您显示的示例,请尝试遵循 GNUawk解决方案。它使用 GNUFPAT变量awk我们将正则表达式设置为仅根据要求获取匹配的值。

awk -v FPAT='lease ([0-9]{1,3}\.){3}[0-9]{1,3} ' '
{
gsub(/^lease | $/,"",$1)
arr[$1]++
}
END{
for(i in arr){
print i,arr[i]
}
}
' Input_file
$ cut -d" " -f9 file|sort|uniq -c|awk '$0=$2" "$1'
100.64.144.122 1
100.64.144.161 1
100.64.144.30 1
100.64.144.47 1
100.64.144.77 2
100.64.146.13 1
100.64.146.4 1
100.64.146.45 1
100.64.147.17 1
100.64.147.36 1
100.64.147.54 1
100.64.147.66 1
100.64.149.201 1

寻找100.64...

$ grep -Eo "100.64.[0-9]{1,3}.[0-9]{1,3}" file|sort|uniq -c|awk '$0=$2" "$1'

最新更新