我正在使用Ruby Mechanazize Web爬网来从流行的房地产网站中获取数据。我正在使用家庭地址作为关键字来刮擦Zillow,Redfin等上的公共数据。我基本上试图绕过任何HTTP和网络错误。以下救援功能似乎无法完成工作。
def scrape_single(key_word)
#setup agent
agent = Mechanize.new{ |agent|
agent.user_agent_alias = 'Mac Safari'
}
agent.ignore_bad_chunking = true
agent.verify_mode = OpenSSL::SSL::VERIFY_NONE
agent.request_headers = { "Accept-Encoding" => ""}
agent.follow_meta_refresh = true
agent.keep_alive = false
#page setup
begin
agent.get(@@search_engine) do |page|
@@search_result = page.form('f') do |search|
search.q = key_word
end.submit
end
rescue Timeout::Error
puts "Timeout"
retry
rescue Net::HTTPGatewayTimeOut => e
if e.response_code == '504' || '502'
e.skip
sleep 5
end
rescue Net::HTTPBadGateway => e
if e.response_code == '504' || '502'
e.skip
sleep 5
end
rescue Net::HTTPNotFound => e
if e.response_code == '404'
e.skip
sleep 5
end
rescue Net::HTTPFatalError => e
if e.response_code == '503'
e.skip
end
rescue Mechanize::ResponseCodeError => e
if e.response_code == '404'
e.skip
sleep 5
elsif e.response_code == '502'
e.skip
sleep 5
else
retry
end
rescue Errno::ETIMEDOUT
retry
end
return @@search_result # returns Mechanize::Page
end
以下是我在MA中使用地址的关键字获得错误消息的示例。
/home/ec2-user/.gem/ruby/2.1/gems/mechanize-2.7.5/lib/mechanize/mechanize/http/agent.rb:323:infetch':404 => net => net :: htttpnotfound forhttps://www.redfin.com/ma/washington/306-werden-rd-unknown/home/134059623-未手动响应(Mechanagizizize :: ResplyeCodeError(
输入上述URL时,您会看到的实际消息是:
无法获得/MA/WASHITTON/306-WERDEN-RD-INKNOWN/HOME/HOME/134059623
我的目标是简单地忽略和跳过零星错误,然后转到下一个关键字。我真的找不到在线工作解决方案,任何反馈都将不胜感激。
如果我理解引起的错误是 Mechanagize :: wendessecodeError ,这显然是 404 response_code。但是在您的脚本中,您不会从 Mechanizize :: ResponseCodeEreror
提出404 Response_codeall_response_code = ['403', '404', '502']
rescue Mechanize::ResponseCodeError => e
if all_response_code.include? response_code
e.skip
sleep 5
else
retry
end
也许如果您添加了404 Response_code的条件,它将执行技巧
编辑我更改了代码,以使行更少