使用Nokogiri解析HTML时出现问题

我有一些HTML，希望获得<body>元素下的内容。然而，无论我尝试了什么，在使用Nokogiri解析HTML之后，<doctype>和<head>中的所有内容也都成为了<body>元素的一部分，当我检索<body>元素时，我也看到了<doctype>以及<meta>和<script>标记中的内容。

我的原始HTML是：

 <!DOCTYPE html "about:legacy-compat">
<html>
   <head>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <title>Some Title</title>
      <meta name='viewport' id='helloviewport' content='initial-scale=1.0,maximum-scale=2.5' />
      <link rel='stylesheet' id='hello-stylesheet' type='text/css' href='some-4ac294cd125e1a062562aca1c83714ff.css'/>
      <script id='hello-javascript' type='text/javascript' src='/hello/hello.js'></script>
   </head>
   <body marginwidth="6" marginheight="6" leftmargin="6" topmargin="6">
      <div class="hello-status">Hello World</div>
      <div valign="top"></div>
   </body>
</html>

我使用的解决方案是：

parsed_html = Nokogiri::HTML(my_html)
body_tag_content = parsed_html.at('body')
puts body_tag_content.inner_html

我得到了什么：

<p>about:legacy-compat"&gt;</p>
n
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
n
<title>Some title</title>
n
<meta name='viewport' id='helloviewport' content='initial-scale=1.0,maximum-scale=2.5' />
n
<link rel='stylesheet' id='hello-stylesheet' type='text/css' href='some-4ac294cd125e1a062562aca1c83714ff.css'/>
n<script id='hello-javascript' type='text/javascript' src='/hello/hello.js'></script>
<div class="hello-status">Hello World</div>
n
<div valign="top">nn</div>

我期待什么：

<div class="hello-status">Hello World</div>
n
<div valign="top">nn</div>

知道这里发生了什么吗？

我首先清理了原始HTML，使您的示例开始工作。我从Doctype中删除了"about:legacy compat"，这似乎把Nokogiri搞砸了：

# clean up the junk in the doctype
my_html.sub!(""about:legacy-compat"", "")
# parse and get the body
parsed_html = Nokogiri::HTML(my_html)
body_tag_content = parsed_html.at('body')
puts body_tag_content.inner_html
# => "n      <div class="hello-status">Hello World</div>n      <div valign="top"></div>n   "

一般来说，当您解析潜在的脏的第三方数据（如HTML）时，您应该首先清理它，这样解析器就不会阻塞并做意外的事情。您可以通过linter或"整洁"工具运行HTML，尝试自动清理它。当所有其他方法都失败时，你必须如上所述用手清洁它。

Ruby 1.9 中的HTML整洁/清理

相关内容

最新更新

热门标签：