如何在使用Ruby机械化Web爬网时绕过网络错误



我正在使用Ruby Mechanazize Web爬网来从流行的房地产网站中获取数据。我正在使用家庭地址作为关键字来刮擦Zillow,Redfin等上的公共数据。我基本上试图绕过任何HTTP和网络错误。以下救援功能似乎无法完成工作。

def scrape_single(key_word)
    #setup agent
    agent = Mechanize.new{ |agent|
        agent.user_agent_alias = 'Mac Safari'
    }
    agent.ignore_bad_chunking = true
    agent.verify_mode = OpenSSL::SSL::VERIFY_NONE 
    agent.request_headers = { "Accept-Encoding" => ""}
    agent.follow_meta_refresh = true
    agent.keep_alive = false
    #page setup
    begin
      agent.get(@@search_engine) do |page|
        @@search_result = page.form('f') do |search|
          search.q = key_word
        end.submit
      end 
    rescue Timeout::Error
      puts "Timeout"
      retry
    rescue Net::HTTPGatewayTimeOut => e
      if e.response_code == '504' || '502'
        e.skip
        sleep 5
      end
    rescue Net::HTTPBadGateway  => e
      if e.response_code == '504' || '502'
        e.skip
        sleep 5
      end
    rescue Net::HTTPNotFound => e
      if e.response_code == '404'
        e.skip
        sleep 5
      end
    rescue Net::HTTPFatalError => e
      if e.response_code == '503'
        e.skip
      end
    rescue Mechanize::ResponseCodeError => e
      if e.response_code == '404'
        e.skip
        sleep 5
      elsif e.response_code == '502'
        e.skip
        sleep 5
      else
        retry
      end
    rescue Errno::ETIMEDOUT
      retry
    end
    return @@search_result      # returns Mechanize::Page
  end 

以下是我在MA中使用地址的关键字获得错误消息的示例。

/home/ec2-user/.gem/ruby/2.1/gems/mechanize-2.7.5/lib/mechanize/mechanize/http/agent.rb:323:infetch':404 => net => net :: htttpnotfound forhttps://www.redfin.com/ma/washington/306-werden-rd-unknown/home/134059623-未手动响应(Mechanagizizize :: ResplyeCodeError(

输入上述URL时,您会看到的实际消息是:

无法获得/MA/WASHITTON/306-WERDEN-RD-INKNOWN/HOME/HOME/134059623

我的目标是简单地忽略和跳过零星错误,然后转到下一个关键字。我真的找不到在线工作解决方案,任何反馈都将不胜感激。

如果我理解引起的错误是 Mechanagize :: wendessecodeError ,这显然是 404 response_code。但是在您的脚本中,您不会从 Mechanizize :: ResponseCodeEreror

提出404 Response_code
all_response_code = ['403', '404', '502']
rescue Mechanize::ResponseCodeError => e
  if all_response_code.include? response_code 
    e.skip
    sleep 5
  else
    retry
  end

也许如果您添加了404 Response_code的条件,它将执行技巧

编辑我更改了代码,以使行更少

最新更新