我如何以这种特定方式解析此Craigslist页面

这是有问题的页面：http://phoenix.craigslist.org/cpg/

我想做的是创建一个如下所示的数组：

日期（由该页面上的 h4 标记捕获）=> 在单元格[0][0][0] 中，
链接单元格[0][1][0]
中的文本 =>链接 href => 在单元格[0][1][1]

即在每一行中，我每行存储这些项目中的每一个。

我所做的只是简单地将所有h4标签拉入并将它们存储在这样的哈希中：

contents2[link[:date]] = content_page.css("h4").text

这样做的问题是，一个单元格将 h4 标签中的所有文本存储在整个页面上......而我想要 1 个日期到 1 个单元格。

举个例子：

0 => Mon May 28 - Leads need follow up - (Phoenix) - http://phoenix.craigslist.org/wvl/cpg/3043296202.html
1=> Mon May 28 - .Net/Java Developers - (phoenix) - http://phoenix.craigslist.org/cph/cpg/3043067349.html

关于我如何使用代码处理这个问题的任何想法将不胜感激。

这是

怎么回事？

require 'rubygems'
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://phoenix.craigslist.org/cpg/"))
# Postings start inside the second blockquote on the page
bq = doc.xpath('//blockquote')[1]
date  = nil         # Temp store of date of postings
posts = Array.new   # Store array of all postings here
# Loop through all blockquote children collecting data as we go along...
bq.children.each { |nod|
  # The date is stored in the h4 nodes. Grab it from there.
  date = nod.text if nod.name == "h4"
  # Skip nodes until we have a date
  next if !date
  # Skip nodes that are not p blocks. The p blocks contain the postings.
  next if nod.name != "p"
  # We have a p block. Extract posting data.
  link = nod.css('a').first['href']
  text = nod.text
  # Add new posting to array
  posts << [date, text, link]
}
# Output everything we just collected
posts.each { |p| puts p.join(" - ") }

还有其他方法，但遍历可能是最简单的方法：

doc.traverse do |node|
  @date = node.text if node.name == 'h4'
  next unless @date
  break if node.text['next 100 postings']
  puts [@date, node.parent.text, node[:href]].join(' - ') if node.name == 'a'
end

相关内容

最新更新

热门标签：