使用sed | awk的域匹配

  • 本文关键字:awk sed 使用 bash
  • 更新时间 :
  • 英文 :


我想从url列表中删除域。url的列表可以包含随机url数据例如:

hqtechvietnam.com/bcm943602cs-hackintosh-meedf/
hqxbcialyc.servequake.com
hqzjz7fncd.com
hraparak.org
hrcrossing.com
hrgenius-uk.com
hrms.prodigygroupindia.com
hrome-updater.ru
hrome-update.ru
hrowedinizoin.ru
hrydc.org
hsadjy30bjtnd.servecounterstrike.com
hsa.ht
HSBC Invest Direct Ltd
hs-fileserver.info
hslvizag.in
hssubnsx.xyz
htaminorfault.xyz
htempurl.com
http://185.102.122[]2/rrtn/Spencer crypt.exe
http://23.95.200195/image/images.exe

我目前正在使用下面的shell脚本来排序数据

#PATTERN
URL_MATCH="(http|https|hxxp|hxxps)://[a-zA-Z0-9./?=_%:-]*"
DOMAIN_MATCH="^[a-zA-Z0-9]+([-.]?[a-zA-Z0-9]+)*.[a-zA-Z]+$"
IP_MATCH="[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}"
CHK1 () {
echo "Initiating Check process #1" |& GET_LOG
while read -r DOMAINLIST;
do 

if grep -oqE "${IP_MATCH}" <<< "${DOMAINLIST}" 
then 
echo "${DOMAINLIST}" | grep -oE "${IP_MATCH}" >> "${IPOUT}" 
elif  grep -oqE "${URL_MATCH}" <<< "${DOMAINLIST}"
then
echo "${DOMAINLIST}" | awk -F / '{l=split($3,a,"."); print (a[l-1]=="com"?a[l-2] OFS:X) a[l-1] OFS a[l]}' OFS="." >> "${URLOUT}" 
elif grep -oqE "${DOMAIN_MATCH}" <<< "${DOMAINLIST}"
then 
echo "${DOMAINLIST}" | sed 's/.*.(w*.w*)/1/' >> "${DOMAINOUT}"
else
echo "${DOMAINLIST}" >> "${ERROROUT}" 

fi
done < "${INFILE}"

}

上面的一段代码目前做得还可以,下面是的结果

URLOUT FILE:
hqzjz7fncd.com
hraparak.org
hrcrossing.com
hrgenius-uk.com
hrome-updater.ru
hrome-update.ru
hrowedinizoin.ru
hrydc.org
hsa.ht
hs-fileserver.info
hslvizag.in
hssubnsx.xyz
htaminorfault.xyz
htempurl.com
prodigygroupindia.com
servecounterstrike.com
servequake.com
ERROUT FILE:
hqtechvietnam.com/bcm943602cs-hackintosh-meedf/
HSBC Invest Direct Ltd
102.122[]2

但是如果url列表包含等数据

google.co.uk 
example.co.in
https://example.co.au/file1
http://example.co.au/file1

它只会给我

co.uk
co.in

我想要

google.co.uk 
example.co.uk
example.co.au

如果url是

mail.google.com
example.com.uk 

预期输出应为

google.com
example.com.uk

您可以在bash中完成此操作,而无需外部工具:Shell参数扩展

shopt -s extglob
while read -r line; do
# remove any leading http:// https:// hxxp:// hxxps://
line=${line#h@(tt|xx)p?(s)://}
# remove any trailing path
line=${line%%/*}
# print the line if it has at least one dot.
[[ $line == *.* ]] && echo "$line"
done < file

使用所有组合样本输入:

hqtechvietnam.com
hqxbcialyc.servequake.com
hqzjz7fncd.com
hraparak.org
hrcrossing.com
hrgenius-uk.com
hrms.prodigygroupindia.com
hrome-updater.ru
hrome-update.ru
hrowedinizoin.ru
hrydc.org
hsadjy30bjtnd.servecounterstrike.com
hsa.ht
hs-fileserver.info
hslvizag.in
hssubnsx.xyz
htaminorfault.xyz
htempurl.com
185.102.122[]2
23.95.200195
google.co.uk
example.co.in
example.co.au
example.co.au

最新更新