使用 Nokogiri 和 Ruby 抓取 iframe 数据

这是我编写的脚本，用于使用 Nokogiri 抓取 <iframe> 标签内的数据：

require 'nokogiri'
require 'restclient'
doc = Nokogiri::HTML(RestClient.get("http://www.sample_site.com/")) 
doc.xpath('//iframe[@width="1001" and @height="973"]').children

我得到这样的：

=> [#<Nokogiri::XML::Text:0x1913970 "rnYour browser does not support inline framesrn">]

谁能告诉我为什么？

iframe用于

在当前 HTML 文档中嵌入另一个文档。这意味着 iframe 从 src 属性中指定的外部源加载其内容。

因此，如果要抓取iframe内容，则应向外部源发送请求，从该源加载其内容。

# The iframe (notice the 'src' attribute)
<iframe src="iframe_source_url" height="973" width="1001">
  # iframe content
</iframe>
# Code to do the scraping
doc = RestClient.get('iframe_source_url')
parsed_doc = Nokogiri::HTML(doc) 
parsed_doc.css('#yourSelectorHere') # or parsed_doc.xpath('...')

注意（关于错误）

当您进行抓取时，您使用的 HTTP 客户端充当您的浏览器（您的浏览器是restclient ）。该错误表示您的浏览器不支持内联框架，换句话说，restclient不支持内联框架，这就是它无法加载框架内容的原因。

这个问题要解决RestClient，而不是Nokogiri。

RestClient 不会检索 iframe 的内容。您可能想尝试检查RestClient.get("http://www.sample_site.com/")的内容，将出现如下字符串：

<iframe src="page-1.htm" name="test" height="120" width="600">
  You need a Frames Capable browser to view this content.
</iframe>

Nokogiri可以很好地处理这个问题，它会返回iframe节点的内容，这显然是唯一具有您因此产生的字符串TextNode。

相关内容

最新更新

热门标签：