为什么xpath在HTML标记之外返回文本

我正在处理一个在<html>标签之外有一些text的文档。当我读取body内的数据时，它也会返回甚至不在html标签中的文本。

page_text = Nokogiri::HTML(open(file_path)).xpath("//body").text
p page_text

输出:

"WARC/1.0nWARC-Type: responsenWARC-Date: 2012-02-11T04:48:01ZnWARC-TREC-ID: clueweb12-0000tw-13-04988nWARC-IP-Address: 184.85.26.15nWARC-Payload-Digest: sha1:PNCB5NNAA766RLLISZ6ODV3FJZBCATKRnWARC-Target-URI: http://www.allchocolate.com/health/basics/nWARC-Record-ID: nContent-Type: application/http; msgtype=responsenContent-Length: 14577nnnnn sample documentnnn hello worldnn"

文档:

WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-11T04:48:01Z
WARC-TREC-ID: clueweb12-0000tw-13-04988
WARC-IP-Address: 184.85.26.15
WARC-Payload-Digest: sha1:PNCB5NNAA766RLLISZ6ODV3FJZBCATKR
WARC-Target-URI: http://www.allchocolate.com/health/basics/
WARC-Record-ID: <urn:uuid:ff32c863-5066-4f51-802a-f31d4af074d5>
Content-Type: application/http; msgtype=response
Content-Length: 14577
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
    <title>sample document</title>
</head>
<body>
    hello world
</body>
</html>

Nokogiri试图将文件内容解析为HTML文档，但它不是一个有效的文档。它是一个文本文档，只是碰巧包含了一个HTML文档。当然，Nokogiri并不知道这些，它自己也无法分辨出哪些部分是HTML，所以它试图解析整个内容。因为它不是有效的HTML，所以会产生错误。

在进行解析时，Nokogiri试图尽可能地修复这些错误，但在本例中不起作用，并导致您在这里看到的奇怪输出。

特别是，当Nokogiri在HTML之前看到文本时，它认为它应该是HTML文档主体的一部分。因此，在添加文本作为body的子元素之前，它创建并注入html和body元素到文档中。

之后，它看到了实际的<body>标记，但由于它知道它已经有一个body元素，并且只能有一个这样的元素，所以它忽略了它。

您需要确保您只提供有效的HTML(或尽可能接近有效-错误纠正可以解决一些小问题)。您可能需要以某种方式预处理文件，以删除开头的额外文本。

明显的前导文本是一个问题，但不是尾随文本。XML是一种高度结构化的语言，对HTML应用XML解析器至少意味着必须拥有有效的HTML。如果你没有有效的HTML，那么你就只能得到Nokogiri吐出的东西。

在我看来，Nokogiri将整个内容包装在默认根节点中，然后返回其中的所有文本节点，基本上忽略了//body xpath。有趣的是，如果将文本包装在div中并搜索xpath //div，没有问题，因此这可能是一个解决方案。

Nokogiri似乎认为//body等于根节点。啊!也许Nokogiri使用<body>作为根节点。不行:xpath /body//body不能工作。

对评论的回应:

您可以使用正则表达式搜索<body>标记，然后插入div标记。但是用一个简单的正则表达式搜索html将是一个脆弱的解决方案，它不会在所有情况下工作。

顺便说一下，您可以看到Nokogiri如何处理标签之外的文本，方法是解析一个只有文本的文档:hello world，然后打印出Nokogiri找到的所有节点:

require 'nokogiri'
nodes = Nokogiri::HTML(open('html.html')).xpath('//*')
nodes.each do |node|
  puts node.name
end
--output:--
html
body
p

所以Nokogiri用三个标签包装文本。

或者，更好的是，您可以解析文档并将其打印为html:

require 'nokogiri'
doc = Nokogiri::HTML(open('./html.html'))
puts doc.to_html
--output:--
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html><body><p>WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-11T04:48:01Z
WARC-TREC-ID: clueweb12-0000tw-13-04988
WARC-IP-Address: 184.85.26.15
WARC-Payload-Digest: sha1:PNCB5NNAA766RLLISZ6ODV3FJZBCATKR
WARC-Target-URI: http://www.allchocolate.com/health/basics/
WARC-Record-ID: <uuid:ff32c863-5066-4f51-802a-f31d4af074d5>
Content-Type: application/http; msgtype=response
Content-Length: 14577


    <title>sample document</title>

    hello world

</uuid:ff32c863-5066-4f51-802a-f31d4af074d5></p></body></html>

这意味着你可以得到这样的hello world:

require 'nokogiri'
doc = Nokogiri::HTML(open('./html.html'))
title = doc.at_xpath('//title')
puts title.next.text.strip
--output:--
hello world

另一种方法是在使用Nokogiri进行解析之前去掉非html内容:

require 'nokogiri'
infile = File.open('html.html')
non_html = infile.gets(sep="nn")
html = infile.gets(nil)  #Slurp the rest of the file
doc = Nokogiri::HTML(html)
puts doc.at_xpath('//body').text.strip
--output:--
hello world

假设非html内容和html内容之间总是有一个空行分隔

首先，@7stud的答案是当场，你可以打破你的文件在nn，但是在我的文档集合中，在实际的html代码之前并不总是nn。

所以使用相同的想法，我有另一个解决方案，即删除html开始标签之前的所有文本使用regex，然后将其传递给Nokogiri来解析。

file = File.read(file_path).to_s
file = file.sub(/.*?(?=<html)/im,"")
page = Nokogiri::HTML(file)

在传递给Nokogiri之前对内容进行预处理是很简单的:

require 'nokogiri'
text = '
WARC/1.0
WARC-Type: response
WARC-Date: 2012-02-11T04:48:01Z
WARC-TREC-ID: clueweb12-0000tw-13-04988
WARC-IP-Address: 184.85.26.15
WARC-Payload-Digest: sha1:PNCB5NNAA766RLLISZ6ODV3FJZBCATKR
WARC-Target-URI: http://www.allchocolate.com/health/basics/
WARC-Record-ID: <urn:uuid:ff32c863-5066-4f51-802a-f31d4af074d5>
Content-Type: application/http; msgtype=response
Content-Length: 14577
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
    <title>sample document</title>
</head>
<body>
    hello world
</body>
</html>
'
doc = Nokogiri::HTML(text[/<!DOCTYPE.+/m])
doc.to_html # => "<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">n<head>n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">n    <title>sample document</title>n</head>n<body>n    hello worldn</body>n</html>n"

技巧是:

text[/<!DOCTYPE.+/m]

告诉Ruby查找文本并返回从<!DOCTYPE到字符串末尾的所有文本，这是有效的HTML。

相关内容

最新更新

热门标签：