使用linux/unix脚本在HTML的特定单词后面获取多个单词

我有一个文件'movie.html'：

<html>
<head><title>Index of /Data/Movies/Hollywood/2016_2017/</title></head>
<body bgcolor="white">
<h1>Index of /Data/Movies/Hollywood/2016_2017/</h1><hr><pre><a href="../">../</a>
<a href="1%20Buck%20%282017%29/">1 Buck (2017)/</a>                                     25-Nov-2019 10:25       -
<a href="1%20Mile%20to%20You%20%282017%29/">1 Mile to You (2017)/</a>                              25-Nov-2019 10:26       -
<a href="1%20Night%20%282016%29/">1 Night (2016)/</a>                                    25-Nov-2019 10:27       -
</pre><hr></body>
</html>

我想得到多个单词的管道分隔如下：

title | link
1 Buck (2017) | 1%20Buck%20%282017%29/
1 Mile to You (2017) | 1%20Mile%20to%20You%20%282017%29/
1 Night (2016) | 1%20Night%20%282016%29/

我试过这个代码：

awk -F'[><]' 'BEGIN{ print "title","link" } /%29/ {print $3,$2}' movie.html > output.txt

但是输出并不是我所期望的，请帮帮我，我还是一个初学者

不建议使用regex解析html，原因如下(请参阅https://stackoverflow.com/a/1732454/12957340)，但这里有一个潜在的解决方案：

awk -F'[<>/"]' 'BEGIN{ print "title | link" }; /(.*)/ {print $6 " | " $3}' movie.html

对于您显示的示例，您可以尝试以下操作吗。我更喜欢match功能。

awk '
BEGIN{
OFS=" | "
print "title | link"
}
match($0,/^<a href="[^"]*/){
val=substr($0,RSTART+9,RLENGTH-9)
match($0,/>.*</a>/)
print substr($0,RSTART+1,RLENGTH-6),val
}' Input_file

解释：添加以上详细解释。

awk '                                      ##Starting awk program from here.
BEGIN{                                     ##Starting BEGIN section of this program from here.
OFS=" | "                                ##Setting OFS as space | space here.
print "title | link"                     ##Printing title space | space link here.
}
match($0,/^<a href="[^"]*/){               ##Using match to match regex from starting of line <a href=" till " comes.
val=substr($0,RSTART+9,RLENGTH-9)        ##Creating val which has sub string of matched above text, making it as per OP needs here.
match($0,/>.*</a>/)                     ##Using match to match from > till </a> here.
print substr($0,RSTART+1,RLENGTH-6),val  ##Printing current matched sub string(by above match function) and val value here.
}
' Input_file                               ##Mentioning Input_file name here.

如果ed可用/可接受，并且您了解使用非html解析器解析hmtl文件的风险。

script.ed

0a
title | link
.
p
g/^<a href=.{1,}/s/^.{1,}="//
s//[[:blank:]]*</a>.*$//
s/">/ /
s/^([^ ]{1,}) (.{1,})/2 | 1/p
Q

然后

ed -s file.html < script.ed

另一种方法，我认为您可以使用grep获得处理后的行，然后使用awk格式输出内容。

grep -oP 'href="([^".]*)">([^</.]*)' movie.html | awk -F'[">]' 'BEGIN{print "title | link"}{print $4" | "$2}'

grep将得到如下线路：

href="1%20Buck%20%282017%29/">1 Buck (2017)
href="1%20Mile%20to%20You%20%282017%29/">1 Mile to You (2017)
href="1%20Night%20%282016%29/">1 Night (2016)

将sub()和gsub()函数添加到代码中：

awk -F'[><]' 'BEGIN{ print "title","|", "link" } /%29/ {sub(///, " |", $3);gsub(/^a href="|"$/, "", $2);print $3,$2}' file
title | link
1 Buck (2017) | 1%20Buck%20%282017%29/
1 Mile to You (2017) | 1%20Mile%20to%20You%20%282017%29/
1 Night (2016) | 1%20Night%20%282016%29/

带file > output:

awk -F'[><]' 'BEGIN{ print "title","|", "link" } /%29/ {sub(///, " |", $3);gsub(/^a href="|"$/, "", $2);print $3,$2}' file > output.txt

相关内容

最新更新

热门标签：