如何处理安全cookie与网络爬虫



我在使用nginx的php网站上有一些任务,我试图自动化。我能够登录,但随后对网站其他部分的请求失败,因为我无法捕获一堆cookie。当我抓取响应头时,就像它们不存在一样。所有我得到的是PHPSESSIDSERVERID,我错过了其他5个,虽然我可以看到他们在我的浏览器cookie。我认为其中只有一个被用作持久身份验证令牌。我尝试在PERL中使用JSoup, java URLlwp/mechanize。我应该能够得到它们,因为burp是用Java编写的。

http: REMOVED
POST /authenticate.php HTTP/1.1
Host: REMOVED
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.23)
Gecko/20110920 Firefox/3.6.23
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Proxy-Connection: keep-alive
Referer: REMOVED
Cookie: __utma=35782181.1596497020.1319574836.1319750878.1319821717.7; __utmv=35782181.|1=SignupDate=2011-OCT-24=1;uid="MTU5MTY4Ng==|1319649169|e4db70a9171742176a944f4fdc3613fd963b1b7e";username="dGVzdF9sb2dpbg==|1319649169|b82e24618b06d6b14d7ea64600c84a2d20c3de73"; defaultstat1=10; defaultstat3=10; SERVERID=ww4; PHPSESSID=53a7cd9acbb71ed7e7cc7be680e6c99c; __utmb=35782181.1.10.1319821717; __utmc=35782181; mode=full
Content-Type: application/x-www-form-urlencoded
Content-Length: 57
username=test_login&password=login123&btnLogin=Login
HTTP/1.0 302 Moved Temporarily
Server: nginx
Date: Fri, 28 Oct 2011 17:09:08 GMT
Content-Type: text/html
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: secret=99ba70c185973be0cd25e0f12dd1ea72; path=/
Location: REMOVED
X-Cache: MISS from REMOVED
Via: 1.0 REMOVED (http_scan/4.0.2.6.19)
Proxy-Connection: close

JSoup:

Connection.Response res = JSoup.connect(url)
     .data("username", username)
     .data("password", password)
     .method(Method.POST)
    .execute();
cookies[] = res.cookies();

cookies[]只包含PHPSESSIDSERVERID

示例中的cookie是Google的web分析cookie,它们是通过Javascript设置的。除非你正在编写的爬虫可以执行Javascript,否则这些cookie将永远不会在爬虫中被设置。

您在浏览器中看到的内容与修复此问题完全无关-重要的是爬虫看到,得到和可以做什么。

最新更新