我正在尝试创建一个bash脚本,该脚本下载RSS提要并将每个条目保存为单独的html文件。以下是我目前为止创建的内容:
curl -L https://news.ycombinator.com//rss > hacke.txt
grep -oP '(?<=<description>).*?(?=</description>)' hacke.txt | sed 's/<description>/n<description>/g' | grep '<description>' | sed 's/<description>//g' | sed 's/</description>//g' | while read description; do
title=$(echo "$description" | grep -oP '(?<=<title>).*?(?=</title>)')
if [ ! -f "$title.html" ]; then
echo "$description" > "$title.html"
fi
done
不幸的是,它根本不起作用:(请告诉我我的错误在哪里。
请告诉我我的错误在哪里
你唯一的错误是试图用正则表达式解析XML。你不能用RegEx解析XML/HTML !请使用XML/html解析器,如xidel。
第一个<item>
-element-node(非变量"如你所说):
$ xidel -s "https://news.ycombinator.com/rss" -e '//item[1]'
--output-node-format=xml --output-node-indent
<item>
<title>Show HN: I made an Ethernet transceiver from logic gates</title>
<link>https://imihajlov.tk/blog/posts/eth-to-spi/</link>
<pubDate>Sun, 18 Dec 2022 07:00:52 +0000</pubDate>
<comments>https://news.ycombinator.com/item?id=34035628</comments>
<description><a href="https://news.ycombinator.com/item?id=34035628">Comments</a></description>
</item>
$ xidel -s "https://news.ycombinator.com/rss" -e '//item[1]/description'
<a href="https://news.ycombinator.com/item?id=34035628">Comments</a>
请注意,虽然第一个命令的输出是XML,但第二个命令的输出是普通文本!
使用集成的EXPath文件模块,您可以将此文本(!)保存到html文件:
$ xidel -s "https://news.ycombinator.com/rss" -e '
//item/file:write-text(
replace(title,"[<>:"/\|?*]",())||".html", (: remove invalid characters :)
description
)
'
但是您也可以通过解析<description>
-element-node并使用file:write()
来将其保存为适当的HTML:
$ xidel -s "https://news.ycombinator.com/rss" -e '
//item/file:write(
replace(title,"[<>:"/\|?*]",())||".html",
parse-html(description),
{"indent":true()}
)
'
$ xidel -s "Show HN I made an Ethernet transceiver from logic gates.html" -e '$raw'
<html>
<head/>
<body>
<a href="https://news.ycombinator.com/item?id=34035628">Comments</a>
</body>
</html>