我试图从英国英国石油门户网站上刮擦项目清单,但我的代码没有返回数据。相反,我想制作一系列项目标题。
class Entry
def initialize(title)
@title = title
end
attr_reader :title
end
def index
@projects=Project.all
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("https://itportal.decc.gov.uk/pathfinder/currentprojectsindex.html"))
entries = doc.css('.operator-container')
@entries = []
entries.each do |row|
title = row.css('.setoutForm').text
@entries << Entry.new(title)
end
end
您发布的链接不包含数据。您看到的页面是一个框架集,每个帧由其自己的URL创建。您想解析左框架,因此您应该编辑代码以打开左帧的URL:
doc = Nokogiri::HTML(open('https://itportal.decc.gov.uk/eng/fox/path/PATH_REPORTS/current-projects-index'))
各个项目在单独的页面上,您需要打开每个项目。例如,第一个是:
project_file = open(entries.first.css('a').attribute('href').value)
project_doc = Nokogiri::HTML(project_file)
" setOutform"类刮擦了很多文本。例如:
> project_doc.css('.setoutForm').text
=> "n n Field Typen Locationn Water De
pth (m)n First Productionn Contactn n n
Oiln 2/15n 155mn Q3/2018n
n John Gilln Business Development Managern
jgill@alphapetroleum.comn 01483 307204n n n
n n Project Summaryn n n
n The Cheviot discovery is located in blocks 2/10a, 2/15a and 3/11b. n
n Reserves are approximately 46mmbbls oil.n
n A Field Development Plan has been submitted and technically approved. The c
oncept is for a leased FPSA with 18+ subsea wells. Oil export will be via tanker offloading.
n n n n "
但是标题不在该文本中。如果您想要标题,请刮擦页面的这一部分:
<div class="field-header" foxid="eu1KcH_d4qniAjiN">Cheviot</div>
您可以使用此CSS选择器:
> project_doc.css('.operator-container .field-header').text
=> "Cheviot"
逐步编写此代码。除非单步,否则很难找出您的代码出错的地方。例如,我使用Nokogiri的命令行工具打开了
的交互式红宝石壳nokogiri https://itportal.decc.gov.uk/eng/fox/path/PATH_REPORTS/current-projects-index