我使用:
xidel -e '//span[@class="titleline"]/a/@href|//span[@class="titleline"]' https://news.ycombinator.com/newest
但是输出不是按照预期的顺序,URL在文本之后,所以很难解析。
我是不是错过了什么才能有好的订单?
我:
There Is No Such Thing as a Microservice (youtube.com)
https://www.youtube.com/watch?v=FXCLLsCGY0s
我希望:
https://www.youtube.com/watch?v=FXCLLsCGY0s
There Is No Such Thing as a Microservice (youtube.com)
或者更好的
https://www.youtube.com/watch?v=FXCLLsCGY0s There Is No Such Thing as a Microservice (youtube.com)
请参阅"在序列上使用/而不是在集合上使用/;为什么会发生这种情况,以及在本例中为什么应该使用XPath 3映射操作符!
:
$ xidel -s "https://news.ycombinator.com/newest" -e '
//span[@class="titleline"]/a ! (@href,.)
'
(也请先指定输入)
对于一个简单的字符串连接,这是不必要的:
-e '//span[@class="titleline"]/a/join((@href,.))'
-e '//span[@class="titleline"]/a/concat(@href," ",.)'
-e '//span[@class="titleline"]/a/x"{@href} {.}"'
(奖励)输出到JSON:
$ xidel -s "https://news.ycombinator.com/newest" -e '
array{
//span[@class="titleline"]/a/{
"title":.,
"url":@href
}
}
'
节点按文档顺序返回,而不是按XPath顺序返回,因此需要进行额外的解析。xmllint
和awk
xmllint --html --recover --xpath '//span[@class="titleline"]/a/@href|//span[@class="titleline"]/a/text()' tmp.html 2>/dev/null|
gawk 'BEGIN{RS="n? href="; FS="n"}{ print $1, $2}' | tr -d '"'
结果
https://github.com/thesephist/ink Ink: Minimal, functional programming language inspired by modern JavaScript, Go
https://controlleddigitallending.org/whitepaper/ A White Paper on Controlled Digital Lending of Library Books
item?id=35471687 Ask HN: Connect Guitar to Tesla?
注意:答案中的Xpath不需要awk
,因为按照文档顺序,href在a/text()
之前。增加了关于如何更改输出顺序的参考。
找到一个更好的方法:
$ xidel -e '//span[@class="titleline"]/a/@href|//span[@class="titleline"]/a/text()' https://news.ycombinator.com/newest