Nokogiri在铁轨上刮擦



所以我的索引操作中有这段代码,我很想把它移到一个模型中,只是对如何做有点困惑。

原始代码

  def index
    urls = %w[http://cltampa.com/blogs/potlikker http://cltampa.com/blogs/artbreaker http://cltampa.com/blogs/politicalanimals http://cltampa.com/blogs/earbuds http://cltampa.com/blogs/dailyloaf http://cltampa.com/blogs/bedpost]
    @final_images = []
    @final_urls = []
    
    urls.each do |url|
      blog = Nokogiri::HTML(open(url)) 
      images = blog.xpath('//*[@class="postBody"]/div[1]//img/@src')
      images.each do |image|
        @final_images << image
      end
      
      story_path = blog.xpath('//*[@class="postTitle"]/a/@href')
      story_path.each do |path|
        @final_urls << path
      end
    end  
  end

我在我的模型中测试了这段代码,它非常适合一个url,只是不确定如何像原始代码一样集成所有url。

新代码

型号

class Photocloud < ActiveRecord::Base
  attr_reader :url, :data
  def initialize(url)
    @url = url
  end
  def data
    @data ||= Nokogiri::HTML(open(url))
  end
  def get_elements(path)
    data.xpath(path)
  end
end

控制器

def index 
  @scraper = Photocloud.new('http://cltampa.com/blogs/artbreaker')
  @photos = @scraper.get_elements('//*[@class="postBody"]/div[1]//img/@src')
  @story_urls = @scraper.get_elements('//*[@class="postBody"]/div[1]//img/@src')
end

我的主要问题是如何初始化多个url并像原始代码一样循环使用它们。我尝试过不同的东西,但感觉自己碰壁了。我需要将它们保存到数据库中,但我想先让它工作起来。非常感谢您的帮助。

更新的控制器-WIP

  def index
    start_urls = %w[http://cltampa.com/blogs/potlikker 
      http://cltampa.com/blogs/artbreaker 
      http://cltampa.com/blogs/politicalanimals 
      http://cltampa.com/blogs/earbuds 
      http://cltampa.com/blogs/dailyloaf 
      http://cltampa.com/blogs/bedpost]
    @scraper = Photocloud.new(start_urls)
    @images = 
    @paths = 
  end

这部分需要一些帮助。。。

似乎您没有将抓取的图像和路径持久化到数据库,因此Photocloud不需要从ActiveRecord::Base继承-它可以只是一个普通的旧ruby对象(PORO):

class Photocloud
  attr_reader :start_urls
  attr_accessor :images, :paths
  def initialize(start_urls)
    @start_urls = start_urls
    @images = []
    @paths = []
  end
  def scrape
    start_urls.each do |start_url|
      blog = Nokogiri::HTML(open(url))
      scrape_images(blog)
      scrape_paths(blog)
    end
  end
  private
  def scrape_images(blog)
    images = blog.xpath('//*[@class="postBody"]/div[1]//img/@src')
    images.each do |image|
      images << image
    end
  end
  def scrape_paths(blog)      
    story_path = blog.xpath('//*[@class="postTitle"]/a/@href')
    story_path.each do |path|
      paths << path
    end
  end
end

控制器内:

scraper = Photocloud.new(start_urls)
scraper.scrape
@images = scraper.images
@paths = scraper.paths

当然,这只是构建代码的可能性之一。

相关内容

  • 没有找到相关文章

最新更新