数据抓取多个页面点击循环



试图找到一种方法,使用一种机械来抓取我们想要从UCAS网站获得的所有数据并将其添加到数组中。目前,我们正在努力为机械化的链接点击编码。想知道是否有人可以提供帮助,在循环中连续三次单击链接以浏览所有搜索结果页面。显示大学所有课程的第一个链接在div类中 更多课程链接

显示课程名称、持续时间和资格的第二个链接位于 DIV 类课程名称区域中

第三个链接在div 课程详细信息显示中,A ID 为coursedetailtab_entryreqs

目前,我们正在使用以下名称抓取Uniname:

class PagesController < ApplicationController
  def home

require 'mechanize'
mechanize = Mechanize.new
@uninames_array = []
   page = mechanize.get('http://search.ucas.com/search/providers?CountryCode=3&RegionCode=&Lat=&Lng=&Feather=&Vac=2&Query=&ProviderQuery=&AcpId=&Location=scotland&IsFeatherProcessed=True&SubjectCode=&AvailableIn=2016')

page.search('li.result h3').each do |h3|
  name = h3.text
  @uninames_array.push(name)
end
while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])
  page.search('li.result h3').each do |h3|
    name = h3.text
    @uninames_array.push(name)
  end
end
puts @uninames_array.to_s
  end
end

课程名称的持续时间和资格来自以下:

require 'mechanize'

mechanize = Mechanize.new
@duration_array = []
@qual_array = []
@courses_array = []
page = mechanize.get('http://search.ucas.com/search/results?Vac=2&AvailableIn=2016&IsFeatherProcessed=True&page=1&providerids=41')

page.search('div.courseinfoduration').each do |x|
puts x.text.strip
page.search('div.courseinfooutcome').each do |y|
puts y.text.strip
end
while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])
page.search('div.courseinfoduration').each do |x|
    name = x
    @duration_array.push(name)
    puts x.text.strip
  end
end
while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])
page.search('div.courseinfooutcome').each do |y|
    name = y
    @qual_array.push(name)
    puts y.text.strip
  end
end
page.search('div.coursenamearea h4').each do |h4|
puts h4.text.strip
end
while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])
page.search('div.coursenamearea h4').each do |h4|
    name = h4.text
    @courses_array.push(name)
    puts h4.text.strip
  end
end
end

如果你想用一个 Mechanize 实例来做到这一点,为什么不把它们串在一起,并将你需要跳转的页面存储在变量中呢?

如果你的所有代码都有效,那么你可以简单地将它们串在一起到一个方法调用中:

def home

  require 'mechanize'
  mechanize = Mechanize.new
  @uninames_array = []
  page = mechanize.get('http://search.ucas.com/search/providers?CountryCode=3&RegionCode=&Lat=&Lng=&Feather=&Vac=2&Query=&ProviderQuery=&AcpId=&Location=scotland&IsFeatherProcessed=True&SubjectCode=&AvailableIn=2016')

  page.search('li.result h3').each do |h3|
    name = h3.text
    @uninames_array.push(name)
  end
  while next_page_link = page.at('.pager a[text()=">"]')
    page = mechanize.get(next_page_link['href'])
    page.search('li.result h3').each do |h3|
      name = h3.text
      @uninames_array.push(name)
    end
  end

@duration_array = []
@qual_array = []
@courses_array = []
page = mechanize.get('http://search.ucas.com/search/results?Vac=2&AvailableIn=2016&IsFeatherProcessed=True&page=1&providerids=41')

page.search('div.courseinfoduration').each do |x|
puts x.text.strip
page.search('div.courseinfooutcome').each do |y|
puts y.text.strip
end
while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])
page.search('div.courseinfoduration').each do |x|
    name = x
    @duration_array.push(name)
    puts x.text.strip
  end
end
while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])
page.search('div.courseinfooutcome').each do |y|
    name = y
    @qual_array.push(name)
    puts y.text.strip
  end
end
page.search('div.coursenamearea h4').each do |h4|
puts h4.text.strip
end
while next_page_link = page.at('.pager a[text()=">"]')
  page = mechanize.get(next_page_link['href'])
page.search('div.coursenamearea h4').each do |h4|
    name = h4.text
    @courses_array.push(name)
    puts h4.text.strip
  end
end

相关内容

  • 没有找到相关文章

最新更新