我需要通过爬网登录詹金斯以收集一些数据,但是与詹金斯的来源相比,net/https获得了一个不完整的页面,这两个都是来源:
。net/https'html
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta http-equiv="refresh" content="1;url=/login?from=%2F">
<script>
window.location.replace('/login?from=%2F');
</script>
</head>
<body style="background-color:white; color:white;">Authentication required</body>
</html>
Nokogiri的XML
=> #
<Nokogiri::HTML::Document:0x1a11444 name="document" children=[#<Nokogiri::XML::DTD:0x1a109b8 name="html">, #
<Nokogiri::XML::Element:0x1a101ac name="html" children=[#<Nokogiri::XML::Element:0x2047ee4 name="head" children=[#<Nokogiri::XML::Element:0x2047d04 name="meta" attributes=[#<Nokogiri::XML::Attr:0x2047ca0 name="http-equiv" value="refresh">, #
<Nokogiri::XML::Attr:0x2047c8c name="content" value="1;url=/login?from=%2F">]>, #
<Nokogiri::XML::Element:0x2047660 name="script" children=[#<Nokogiri::XML::CDATA:0x2047480 "window.location.replace('/login?from=%2F');">]>]>, #
<Nokogiri::XML::Element:0x20471ec name="body" attributes=[#<Nokogiri::XML::Attr:0x2047188 name="style" value="background-color:white; color:white;">] children=[#
<Nokogiri::XML::Text:0x2046d50 "Authentication required">]>]>]>
Jenkins的Source
<!DOCTYPE html>
<html>
<head resURL="/static/98ff49d3">
<title>Jenkins</title>
<link rel="stylesheet" type="text/css" href="/static/98ff49d3/css/style.css" />
<link rel="stylesheet" type="text/css" href="/static/98ff49d3/css/color.css" />
<link rel="stylesheet" type="text/css" href="/static/98ff49d3/css/responsive-grid.css" />
<link rel="shortcut icon" type="image/vnd.microsoft.icon" href="/static/98ff49d3/favicon.ico" />
<script>
var isRunAsTest = false;
var rootURL = "";
var resURL = "/static/98ff49d3";
</script>
<script src="/static/98ff49d3/scripts/prototype.js" type="text/javascript"></script>
<script src="/static/98ff49d3/scripts/behavior.js" type="text/javascript"></script>
<script src='/adjuncts/98ff49d3/org/kohsuke/stapler/bind.js' type='text/javascript'></script>
<script src="/static/98ff49d3/scripts/yui/yahoo/yahoo-min.js"></script>
<script src="/static/98ff49d3/scripts/yui/dom/dom-min.js"></script>
<script src="/static/98ff49d3/scripts/yui/event/event-min.js"></script>
<script src="/static/98ff49d3/scripts/yui/animation/animation-min.js"></script>
<script src="/static/98ff49d3/scripts/yui/dragdrop/dragdrop-min.js"></script>
<script src="/static/98ff49d3/scripts/yui/container/container-min.js"></script>
<script src="/static/98ff49d3/scripts/yui/connection/connection-min.js"></script>
<script src="/static/98ff49d3/scripts/yui/datasource/datasource-min.js"></script>
<script src="/static/98ff49d3/scripts/yui/autocomplete/autocomplete-min.js"></script>
<script src="/static/98ff49d3/scripts/yui/menu/menu-min.js"></script>
<script src="/static/98ff49d3/scripts/yui/element/element-min.js"></script>
<script src="/static/98ff49d3/scripts/yui/button/button-min.js"></script>
<script src="/static/98ff49d3/scripts/yui/storage/storage-min.js"></script>
<script src="/static/98ff49d3/scripts/hudson-behavior.js" type="text/javascript"></script>
<script src="/static/98ff49d3/scripts/sortable.js" type="text/javascript"></script>
<script>
crumb.init("", "");
</script>
<link rel="stylesheet" type="text/css" href="/static/98ff49d3/scripts/yui/container/assets/container.css" />
<link rel="stylesheet" type="text/css" href="/static/98ff49d3/scripts/yui/assets/skins/sam/skin.css" />
<link rel="stylesheet" type="text/css" href="/static/98ff49d3/scripts/yui/container/assets/skins/sam/container.css" />
<link rel="stylesheet" type="text/css" href="/static/98ff49d3/scripts/yui/button/assets/skins/sam/button.css" />
<link rel="stylesheet" type="text/css" href="/static/98ff49d3/scripts/yui/menu/assets/skins/sam/menu.css" />
<meta name="ROBOTS" content="INDEX,NOFOLLOW" />
<script src="/static/98ff49d3/scripts/yui/cookie/cookie-min.js"></script>
<link rel="stylesheet" type="text/css" href="/static/98ff49d3/plugin/sectioned-view/sectioned-view.css" />
</head>
<body id="jenkins" data-version="jenkins-1.596.1" class="yui-skin-sam jenkins-1.596.1"><a href="#skip2content" class="skiplink">Skip to content</a>
<div id="page-head">
<div id="header">
<div class="logo">
<a id="jenkins-home-link" href="/">
<img id="jenkins-head-icon" alt="title" src="/static/98ff49d3/images/headshot.png" />
<img id="jenkins-name-icon" height="34" alt="title" width="139" src="/static/98ff49d3/images/title.png" />
</a>
</div>
<div class="login"> <a href="/login?from=%2F"><b>log in</b></a>
|
<a href="/signup"><b>sign up</b></a>
</div>
<div class="searchbox hidden-xs">
<form style="position:relative;" name="search" action="/search/" class="no-json" method="get">
<div id="search-box-minWidth"></div>
<div id="search-box-sizer"></div>
<div id="searchform">
<input id="search-box" placeholder="search" name="q" class="has-default-text" />
<a href="http://wiki.jenkins-ci.org/display/JENKINS/Search+Box">
<img style="width: 16px; height: 16px; " class="icon-help icon-sm" src="/static/98ff49d3/images/16x16/help.png" />
</a>
<div id="search-box-completion"></div>
<script>
createSearchBox("/search/");
</script>
</div>
</form>
</div>
</div>
<div id="breadcrumbBar">
<tr id="top-nav">
<td id="left-top-nav" colspan="2">
<link rel='stylesheet' href='/adjuncts/98ff49d3/lib/layout/breadcrumbs.css' type='text/css' />
<script src='/adjuncts/98ff49d3/lib/layout/breadcrumbs.js' type='text/javascript'></script>
<div class="top-sticker noedge">
<div class="top-sticker-inner">
<div id="right-top-nav"></div>
<ul id="breadcrumbs">
<li class="item"><a class="model-link inside" href="/">Jenkins</a>
</li>
<li class="children" href="/"></li>
</ul>
<div id="breadcrumb-menu-target"></div>
</div>
</div>
</td>
</tr>
</div>
</div>
<div id="page-body">
<div class="row">
<div id="side-panel">
<div id="side-panel-content"></div>
</div>
<div id="main-panel">
<div id="main-panel-content">
<a name="skip2content"></a>
<div style="margin: 2em;">
<form style="text-size:smaller" name="login" action="j_acegi_security_check" method="post">
<table>
<tr>
<td>User:</td>
<td>
<input type="text" name="j_username" id="j_username" />
</td>
</tr>
<tr>
<td>Password:</td>
<td>
<input type="password" name="j_password" />
</td>
</tr>
<tr>
<td align="right">
<input id="remember_me" type="checkbox" name="remember_me" />
</td>
<td>
<label for="remember_me">Remember me on this computer</label>
</td>
</tr>
</table>
<input name="from" value="/" type="hidden" />
<input name="Submit" value="log in" class="submit-button primary" type="submit" />
<script>
$('j_username').focus();
</script>
</form>
<div style="margin-top:2em"><a href="signup">Create an account</a> if you are not a member yet.</div>
</div>
</div>
</div>
</div>
</div>
<div id="footer-container" class="hidden-xs">
<div id="footer"><span class="page_generated">
Page generated:
May 5, 2015 1:09:35 PM</span><span class="rest_api"><a href="api/">REST API</a></span><span class="jenkins_ver"><a href="http://jenkins-ci.org/">Jenkins ver. 1.596.1</a></span>
<div id="l10n-dialog" class="dialog"></div>
<div id="l10n-footer" style="display:none; float:left">
<a href="#" onclick="return showTranslationDialog();">
<img src="/static/98ff49d3/plugin/translation/flags.png" />Help us localize this page
</a>
</div>
<script>
var footer = document.getElementById('l10n-footer');
var f = document.getElementById('footer');
f.insertBefore(footer, f.firstChild);
footer.style.display = "block";
var translation = {};
translation.bundles = "6CPNEARN8E/l4k/4nMQznROeAYoCO7auJUGWM6qMGBK2/ELamFqR7whqOnrQ+pYEU4X6xVw11/3WEM16VclDS66Hi2QY5S41H0NSwFiE07KHND+iP3c2Zb4MiiqIOrGRLMJEPdu/j3QYQ5Yp2rkj/ISZWOGFVY86zs/0JsDEw+VJN9dlaSkRcelDKNfziTE/8K7Sabhhd0we7ATzNTgNrfenUCaCdwR7BqPc7354m+fmVz7/8DpcYBMzl78E3+DpUF6sJa18uD7OkgPMNYz8lIM9Bx1ZXanyOk49M8Sea9qj+teMndv9kiyawWnloiBlg3KdK0DfZs1v+RbCQ/HnYcIcjAZVgKTYD2S0GpSj5oHMFQeTemQRnbj6WMon3u7Z8q3np+0Ucgxcs1LfKqprNmeugoD5jIxCuHhHCQvaHdw=";
translation.detectedLocale = "";
function showTranslationDialog() {
if (!translation.launchDialog)
loadScript("/static/98ff49d3/plugin/translation/dialog.js");
else
translation.launchDialog();
return false;
}
</script>
</div>
</div>
</body>
</html>
我需要詹金斯源的这些行,才能填充和登录:
<input type="text" name="j_username" id="j_username" />
<input type="password" name="j_password" />
<input name="Submit" value="log in" class="submit-button primary" type="submit" />
这是我正在运行的代码来获取此数据:
1 require 'rubygems'
2 require 'nokogiri'
3 require 'net/https'
4 require 'openssl'
5 require 'mechanize'
6
7 class JenkinsTest
8 # Request the Jenkins webpage
9 def request_jenkins_webpage
10 uri = URI.parse("https://jenkinspage.com:8443")
11 http = Net::HTTP.new(uri.host, uri.port)
12 http.use_ssl = true
13 http.verify_mode = OpenSSL::SSL::VERIFY_NONE
14 request = Net::HTTP::Get.new(uri.request_uri)
15 response = http.request(request)
16 @@page = Nokogiri::HTML(response.body)
17 end
18
19 def print_jenkins_webpage
20 puts @@page
21 end
22 end
一些额外的注释:网络具有代理,没有登录/密码;詹金斯的证书是自签名的;
我的问题是,为什么会发生以及如何修复?
预先感谢!
感谢@thetinman,@markthomas和一个同事的帮助,我已经设法登录詹金斯并通过Mechanagizize和Nokogiri来收集页面的XML:
1 require 'rubygems'
2 require 'nokogiri'
3 require 'net/https'
4 require 'openssl'
5 require 'mechanize'
6
7 # JenkinsXML logs into Jenkins and gets an XML version of the HTML page.
8
9 class JenkinsXML
10
11 # Jenkins' URIs.
12 @@jenkins_login_uri = "https://jenkinspage.com:8443/login?from=%2F"
13 @@jenkins_page_uri = "https://jenkinspage.com:8443"
14
15 # Log into Jenkins.
16 def log_into_jenkins
17 @@mechanize_agent = Mechanize.new
18 @@mechanize_agent.user_agent_alias = "Windows IE 7"
19 @@mechanize_agent.verify_mode = OpenSSL::SSL::VERIFY_NONE
20 page = @@mechanize_agent.get(@@jenkins_login_uri)
21
22 form = page.forms[1]
23
24 form.j_username = "username-here"
25 form.j_password = "password-here"
26 @@mechanize_agent.submit(form)
27 end
28
29 # Get Jenkins' HTML.
30 def get_jenkins_html
31 @@jenkins_html = @@mechanize_agent.get(@@jenkins_page_uri).body
32 end
33
34 # Get Jenkins' XML.
35 def get_jenkins_xml
36 @jenkins_xml = Nokogiri::HTML(@@jenkins_html)
37 return @jenkins_xml
38 end
39
40 end