Bash脚本,它下载RSS提要并将每个条目保存为单独的html文件



我正在尝试创建一个bash脚本,该脚本下载RSS提要并将每个条目保存为单独的html文件。以下是我目前为止创建的内容:

curl -L https://news.ycombinator.com//rss > hacke.txt
grep -oP '(?<=<description>).*?(?=</description>)' hacke.txt | sed 's/<description>/n<description>/g' | grep '<description>' | sed 's/<description>//g' | sed 's/</description>//g' | while read description; do
title=$(echo "$description" | grep -oP '(?<=<title>).*?(?=</title>)')
if [ ! -f "$title.html" ]; then
echo "$description" > "$title.html"
fi
done

不幸的是,它根本不起作用:(请告诉我我的错误在哪里。

请告诉我我的错误在哪里

你唯一的错误是试图用正则表达式解析XML。你不能用RegEx解析XML/HTML !请使用XML/html解析器,如xidel。

第一个<item>-element-node(非变量"如你所说):

$ xidel -s "https://news.ycombinator.com/rss" -e '//item[1]' 
--output-node-format=xml --output-node-indent
<item>
<title>Show HN: I made an Ethernet transceiver from logic gates</title>
<link>https://imihajlov.tk/blog/posts/eth-to-spi/</link>
<pubDate>Sun, 18 Dec 2022 07:00:52 +0000</pubDate>
<comments>https://news.ycombinator.com/item?id=34035628</comments>
<description>&lt;a href=&quot;https://news.ycombinator.com/item?id=34035628&quot;&gt;Comments&lt;/a&gt;</description>
</item>
$ xidel -s "https://news.ycombinator.com/rss" -e '//item[1]/description'
<a href="https://news.ycombinator.com/item?id=34035628">Comments</a>

请注意,虽然第一个命令的输出是XML,但第二个命令的输出是普通文本!

使用集成的EXPath文件模块,您可以将此文本(!)保存到html文件:

$ xidel -s "https://news.ycombinator.com/rss" -e '
//item/file:write-text(
replace(title,"[<>:&quot;/\|?*]",())||".html",   (: remove invalid characters :)
description
)
'

但是您也可以通过解析<description>-element-node并使用file:write()来将其保存为适当的HTML:

$ xidel -s "https://news.ycombinator.com/rss" -e '
//item/file:write(
replace(title,"[<>:&quot;/\|?*]",())||".html",
parse-html(description),
{"indent":true()}
)
'
$ xidel -s "Show HN I made an Ethernet transceiver from logic gates.html" -e '$raw'
<html>
<head/>
<body>
<a href="https://news.ycombinator.com/item?id=34035628">Comments</a>
</body>
</html>

最新更新