从Python内部验证aspx页面的问题



这里有几个相关的问题,但是我还没能通过看他们的答案来解决我的问题,所以我想我应该试一下。

基本上我正试图从一个需要用户名/密码的网站下载一些*.zip文件。这是网站登录页面:

http://data.theice.com/MyAccount/Login.aspx

一旦登录(在正常的浏览器会话中),我就可以按照下载链接下载我需要的*.zip文件,例如:

http://data.theice.com/MyAccount/Download.aspx?PUID=41483& PDS = 2, PRODID = 3744, TS = 2014

到目前为止,我尝试使用cookielib, urllib, urllib2HTMLParser库。我使用HTMLParser读取__VIEWSTATE__EVENTVALIDATION的值,因为我认为在表单中重新提交相同的值很重要。但是,当我尝试使用正确的登录数据打开登录页面时,我只是检索(未经身份验证的)登录页面。我真的不知道我做错了什么,但如果你能帮助我,我将不胜感激。

谢谢:)

注:我意识到我在这里粘贴了很多代码。我只是为了完整而这样做,但我很确定获取__VIEWSTATE__EVENTVALIDATION值的代码返回正确的值。

import cookielib
import urllib
import urllib2
from HTMLParser import HTMLParser
class IceConnection(object):
    def __init__(self, username, password):
        self.username = username
        self.password = password
        self.url = "http://data.theice.com/MyAccount/Login.aspx"
        self.headers = [
                    ('user-agent','Mozilla/5.0 (Windows NT 6.3; WOW64; rv:30.0) Gecko/20100101 Firefox/30.0'),
                    ('accept','text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
                    ('accept-language','en-US,en;q=0.5'),
                    ('accept-encoding','gzip, deflate'),
                    ('accept-charset','iso-8859-1,utf-8;q=0.7,*;q=0.7'),
                    ('connection','keep-alive'),
                    ('content-type','application/x-www-form-urlencoded')
        ]
        self.cookies = cookielib.CookieJar()
        self.opener = urllib2.build_opener(
            urllib2.HTTPRedirectHandler(),
            urllib2.HTTPHandler(debuglevel=0),
            urllib2.HTTPSHandler(debuglevel=0),
            urllib2.HTTPCookieProcessor(self.cookies)
        )
        self.opener.addheaders = self.headers
        #Extract view_state and event_validation variables:
        field_names = [r'__VIEWSTATE', r'__EVENTVALIDATION']
        field_values = self.extractFields(field_names)
        view_state = field_values[0]
        event_validation = field_values[1]
        self.fields = (
            (r'__EVENTTARGET', r''),
            (r'__EVENTARGUMENT', r''),
            (r'__LASTFOCUS',r''),
            (r'__VIEWSTATE', view_state),
            (r'__EVENTVALIDATION', event_validation),
            (r'ctl00$ContentPlaceHolder1$LoginControl$m_userName', username),
            (r'ctl00$ContentPlaceHolder1$LoginControl$m_password', password)
        )
        login_data = urllib.urlencode(self.fields)
        print response = self.opener.open(self.url, login_data)
    def extractFields(self, field_names):
        response = self.opener.open(self.url)
        html = ''.join(response.readlines())
        ret = list()
        for field in field_names:
            parser = PageParser(field)
            parser.feed(html)
            ret.append(parser.value)
        return ret
class PageParser(HTMLParser):
    def __init__(self, field_name):
        HTMLParser.__init__(self)
        self.field = field_name
    def handle_starttag(self, tag, attrs):
        if tag == 'input':
            #Create dictionary of attributes
            attributes = dict()
            for attr in attrs:
                attributes[attr[0]] = attr[1]
            if attributes.has_key('name'):
                if attributes['name'] == self.field:
                    self.value = attributes['value']

我实际上已经设法通过使用我的浏览器(Google Chrome)来查看发送到服务器的POST头来解决我的问题。我注意到这一行:

__EVENTTARGET:ctl00$ContentPlaceHolder1$LoginControl$LoginButton

所以我用上面的行替换了我代码中的空白字符串,现在它工作了!