xidel:黑客新闻的结果顺序错误



我使用:

xidel -e '//span[@class="titleline"]/a/@href|//span[@class="titleline"]' https://news.ycombinator.com/newest 

但是输出不是按照预期的顺序,URL在文本之后,所以很难解析。

我是不是错过了什么才能有好的订单?

我:

There Is No Such Thing as a Microservice (youtube.com)
https://www.youtube.com/watch?v=FXCLLsCGY0s

我希望:

https://www.youtube.com/watch?v=FXCLLsCGY0s
There Is No Such Thing as a Microservice (youtube.com)

或者更好的

https://www.youtube.com/watch?v=FXCLLsCGY0s There Is No Such Thing as a Microservice (youtube.com)

请参阅"在序列上使用/而不是在集合上使用/;为什么会发生这种情况,以及在本例中为什么应该使用XPath 3映射操作符!:

$ xidel -s "https://news.ycombinator.com/newest" -e '
//span[@class="titleline"]/a ! (@href,.)
'

(也请先指定输入)

对于一个简单的字符串连接,这是不必要的:

-e '//span[@class="titleline"]/a/join((@href,.))'
-e '//span[@class="titleline"]/a/concat(@href," ",.)'
-e '//span[@class="titleline"]/a/x"{@href} {.}"'

(奖励)输出到JSON:

$ xidel -s "https://news.ycombinator.com/newest" -e '
array{
//span[@class="titleline"]/a/{
"title":.,
"url":@href
}
}
'

节点按文档顺序返回,而不是按XPath顺序返回,因此需要进行额外的解析。xmllintawk

xmllint --html --recover --xpath '//span[@class="titleline"]/a/@href|//span[@class="titleline"]/a/text()' tmp.html 2>/dev/null|
gawk 'BEGIN{RS="n? href="; FS="n"}{ print $1, $2}' | tr -d '"'

结果

https://github.com/thesephist/ink Ink: Minimal, functional programming language inspired by modern JavaScript, Go
https://controlleddigitallending.org/whitepaper/ A White Paper on Controlled Digital Lending of Library Books
item?id=35471687 Ask HN: Connect Guitar to Tesla?

注意:答案中的Xpath不需要awk,因为按照文档顺序,href在a/text()之前。增加了关于如何更改输出顺序的参考。

找到一个更好的方法:

$ xidel -e '//span[@class="titleline"]/a/@href|//span[@class="titleline"]/a/text()' https://news.ycombinator.com/newest