Bash脚本过滤掉日志中不相邻的重复项



我正在尝试创建一个脚本来过滤日志中的重复项,并保留每条消息的最新信息。下面是一个样本;

May 29 22:25:19 servername.com Fdm: this is error message 1 error code=0x98765
May 29 22:25:19 servername.com Fdm: this is just a message
May 29 22:25:19 servername.com Fdm: error code=12345 message 2
May 29 22:25:20 servername.com Vpxa: this is error message 1 error code=0x67890
May 29 22:25:20 servername.com Vpxa: just another message
May 29 22:25:30 servername.com Fdm: error code=34567 message 2
May 29 22:25:30 servername.com Fdm: another error message 3 76543

日志分为两个文件,我已经开始创建合并两个文件的脚本,并使用sort-s-r-k1按日期对文件进行排序。

我还设法创建了脚本,这样它就会询问我想要的日期,然后使用grep按日期过滤掉。

现在,我只需要找到一种方法来删除同样具有不同时间戳的不相邻的重复行。我试过awk,但我对awk的了解并不是很好。有什么怪大师能帮助我吗?

附言,我遇到的一个问题是,有相同的行有不同的错误代码,我想删除这些行,但我只能通过grep-v"行的常量部分"来做到这一点。如果有一种方法可以让我按照相似性的百分比删除重复项,那就太好了。此外,我无法让脚本忽略某些字段或列,因为在不同的字段/列中有带有错误代码的行。

预期产出如下;

May 29 22:25:30 servername.com Fdm: another error message 3 76543
May 29 22:25:30 servername.com Fdm: error code=34567 message 2
May 29 22:25:20 servername.com Vpxa: this is error message 1 error code=0x67890

我只想要错误,但是,grep-I错误很容易做到。唯一的问题是具有不同错误代码的重复行。

您可以单独使用sort来完成此操作。

只需从4号开始对字段进行操作即可获得副本:

sort -uk4 file.txt

这将为您提供来自欺骗的第一个条目;如果您想要最后一个,请提前使用tac

tac file.txt | sort -uk4 

示例:

$ cat file.txt      
May 29 22:25:19 servername.com Fdm: [FF93DB90 verbose 'Cluster' opID=SWI-56f32f43] Updating inventory manager with 1 datastores
May 29 22:25:19 servername.com Fdm: [FF93DB90 verbose 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::UpdateDatastoreLockStatus] Lock state change to 4 for datastore /vmfs/volumes/531b5d83-9129a42b-f3f8-001e6849b050
May 29 22:25:19 servername.com Fdm: [FFB03B90 verbose 'Invt' opID=SWI-65391264] [DsStateChange::SaveToInventory] Processing locked error update for /vmfs/volumes/531b5d83-9129a42b-f3f8-001e6849b050 (<unset>) from __localhost__
May 29 22:25:20 servername.com Vpxa: [FFF3AB90 verbose 'vpxavpxaMoVm' opID=SWI-54ad408b] [VpxaMoVm::CheckMoVm] did not find a VM with ID 17 in the vmList
May 21 12:05:02 servername.com Fdm: [FF93DB90 verbose 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::UpdateDatastoreLockStatus] Lock state change to 4 for datastore /vmfs/volumes/531b5d83-9129a42b-f3f8-001e6849b050
May 29 22:25:20 servername.com Vpxa: [FFF3AB90 verbose 'vpxavpxaAlarm' opID=SWI-54ad408b] [VpxaAlarm] VM with vmid = 17 not found
May 30 07:50:07 servername.com Fdm: [FF93DB90 verbose 'Cluster' opID=SWI-56f32f43] Updating inventory manager with 1 datastores
$ sort -uk4 file.txt
May 29 22:25:19 servername.com Fdm: [FF93DB90 verbose 'Cluster' opID=SWI-56f32f43] Updating inventory manager with 1 datastores
May 29 22:25:19 servername.com Fdm: [FF93DB90 verbose 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::UpdateDatastoreLockStatus] Lock state change to 4 for datastore /vmfs/volumes/531b5d83-9129a42b-f3f8-001e6849b050
May 29 22:25:19 servername.com Fdm: [FFB03B90 verbose 'Invt' opID=SWI-65391264] [DsStateChange::SaveToInventory] Processing locked error update for /vmfs/volumes/531b5d83-9129a42b-f3f8-001e6849b050 (<unset>) from __localhost__
May 29 22:25:20 servername.com Vpxa: [FFF3AB90 verbose 'vpxavpxaAlarm' opID=SWI-54ad408b] [VpxaAlarm] VM with vmid = 17 not found
May 29 22:25:20 servername.com Vpxa: [FFF3AB90 verbose 'vpxavpxaMoVm' opID=SWI-54ad408b] [VpxaMoVm::CheckMoVm] did not find a VM with ID 17 in the vmList
$ tac file.txt | sort -uk4         
May 30 07:50:07 servername.com Fdm: [FF93DB90 verbose 'Cluster' opID=SWI-56f32f43] Updating inventory manager with 1 datastores
May 21 12:05:02 servername.com Fdm: [FF93DB90 verbose 'Invt' opID=SWI-56f32f43] [InventoryManagerImpl::UpdateDatastoreLockStatus] Lock state change to 4 for datastore /vmfs/volumes/531b5d83-9129a42b-f3f8-001e6849b050
May 29 22:25:19 servername.com Fdm: [FFB03B90 verbose 'Invt' opID=SWI-65391264] [DsStateChange::SaveToInventory] Processing locked error update for /vmfs/volumes/531b5d83-9129a42b-f3f8-001e6849b050 (<unset>) from __localhost__
May 29 22:25:20 servername.com Vpxa: [FFF3AB90 verbose 'vpxavpxaAlarm' opID=SWI-54ad408b] [VpxaAlarm] VM with vmid = 17 not found
May 29 22:25:20 servername.com Vpxa: [FFF3AB90 verbose 'vpxavpxaMoVm' opID=SWI-54ad408b] [VpxaMoVm::CheckMoVm] did not find a VM with ID 17 in the vmList

要删除具有不同时间戳的相同行,只需检查第15个字符后的重复项即可。

awk '!duplicates[substr($0,15)]++' $filename

如果日志是制表符分隔的,则可以更精确地选择要从中确定重复项的列,这是比试图查找不同文件之间的Levenstein距离更好的解决方案。

您可以始终跳过前3个字段,并使用sort -suk4删除重复项。前3个字段将是日期字符串,因此之后文本相同的任何两行都将被删除。然后,您可以根据输出的需要对字段进行排序

sort -suk4 filename | sort -rs

去除具有不同错误代码的行会更棘手,但我建议将具有错误代码的线隔离到它们自己的文件中,然后使用以下方法:

sed 's/(.*error code=)([0-9]*)/2 1/' errorfile | sort -suk5 | sed 's/([0-9]*) (.*error code=)/21/'

您没有告诉我们如何定义"重复",但如果您指的是同一天的消息,那么这就可以了:

$ tac file | awk '!seen[$1,$2,$3]++' | tac
May 29 22:25:19 servername.com Fdm: error code=12345 message 2
May 29 22:25:20 servername.com Vpxa: just another message
May 29 22:25:30 servername.com Fdm: another error message 3 76543

如果这不是你的意思,那么只需将awk数组中使用的索引更改为你想考虑的重复测试的索引即可。

考虑到你最近的评论,也许这就是你想要的:

$ tac file | awk '!/error/{next} {k=$0; sub(/([^:]+:){3}/,"",k); gsub(/[0-9]+/,"#",k)} !seen[k]++' | tac
May 29 22:25:20 servername.com Vpxa: this is error message 1 error code=0x67890
May 29 22:25:30 servername.com Fdm: error code=34567 message 2
May 29 22:25:30 servername.com Fdm: another error message 3 76543

上面的工作原理是创建一个键值k,它是第一个:之后的部分,不属于时间字段的一部分,所有数字序列都更改为#:

$ awk '!/error/{next} {k=$0; sub(/([^:]+:){3}/,"",k); gsub(/[0-9]+/,"#",k); print $0 ORS "t -> key =", k}' file
May 29 22:25:19 servername.com Fdm: this is error message 1 error code=0x98765
-> key =  this is error message # error code=#x#
May 29 22:25:19 servername.com Fdm: error code=12345 message 2
-> key =  error code=# message #
May 29 22:25:20 servername.com Vpxa: this is error message 1 error code=0x67890
-> key =  this is error message # error code=#x#
May 29 22:25:30 servername.com Fdm: error code=34567 message 2
-> key =  error code=# message #
May 29 22:25:30 servername.com Fdm: another error message 3 76543
-> key =  another error message # #

我设法找到了一种方法。只是为了给你们更多关于我遇到的问题以及这个脚本的作用的细节。

问题:我有日志,我必须清除,但日志有多行重复错误。不幸的是,重复的错误有不同的错误代码,所以我不能只对它们进行grep-v。此外,日志有数以万计的行,因此,为了保持"grep-v"-它们将消耗大量时间,因此,我决定使用脚本将其半自动化。下面是脚本。如果你对如何改进剧本有想法,请发表评论!

#!/usr/local/bin/bash
rm /tmp/tmp.log /tmp/tmpfiltered.log 2> /dev/null
printf "Please key in full location of logs: "
read log1loc log2loc
cat $log1loc $log2loc >> /tmp/tmp.log
sort -s -r -k1 /tmp/tmp.log -o /tmp/tmp.log
printf "Please key in the date: "
read logdate
while [[ $firstlineedit != "n" ]]
do
grep -e "$logdate" /tmp/tmp.log | grep -i error | less
firstline=$(head -n 1 /tmp/tmp.log)
head -n 1 /tmp/tmp.log >> /tmp/tmpfiltered.log
read -p "Enter line to remove(enter n to quit): " -e -i "$firstline" firstlineedit
firstlinecount=$(grep -e "$logdate" /tmp/tmp.log | grep -i error | grep -o "$firstlineedit" | wc -l)
grep -e "$logdate" /tmp/tmp.log | grep -i error | grep -v "$firstlineedit" > /tmp/tmp2.log
mv /tmp/tmp2.log /tmp/tmp.log
if [ "$firstlineedit" != "n" ];
then
echo That line and it"'"s variations have appeared $firstlinecount times in the log!
fi
done
cat /tmp/tmpfiltered.log | less

最新更新