运行python脚本Phantomjs和Selenium时出现超时问题



我正在用Phontomjs和Selenium运行一个python脚本。我面临超时问题。20-50分钟后停止。我需要一个解决方案,这样我就可以在没有超时问题的情况下运行脚本。请问问题在哪里?我该怎么解决?

 The input file cannot be read or no in proper format.
    Traceback (most recent call last):
      File "links_crawler.py", line 147, in <module>
        crawler.Run()
      File "links_crawler.py", line 71, in Run
        self.checkForNextPages()
      File "links_crawler.py", line 104, in checkForNextPages
        self.next.click()
      File "/home/dev/.local/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 75, in click
        self._execute(Command.CLICK_ELEMENT)
      File "/home/dev/.local/lib/python2.7/site-packages/selenium/webdriver/remote/webelement.py", line 454, in _execute
        return self._parent.execute(command, params)
      File "/home/dev/.local/lib/python2.7/site-packages/selenium/webdriver/remote/webdriver.py", line 199, in execute
        response = self.command_executor.execute(driver_command, params)
      File "/home/dev/.local/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 395, in execute
        return self._request(command_info[0], url, body=data)
      File "/home/dev/.local/lib/python2.7/site-packages/selenium/webdriver/remote/remote_connection.py", line 463, in _request
        resp = opener.open(request, timeout=self._timeout)
      File "/usr/lib/python2.7/urllib2.py", line 431, in open
        response = self._open(req, data)
      File "/usr/lib/python2.7/urllib2.py", line 449, in _open
        '_open', req)
      File "/usr/lib/python2.7/urllib2.py", line 409, in _call_chain
        result = func(*args)
      File "/usr/lib/python2.7/urllib2.py", line 1227, in http_open
        return self.do_open(httplib.HTTPConnection, req)
      File "/usr/lib/python2.7/urllib2.py", line 1200, in do_open
        r = h.getresponse(buffering=True)
      File "/usr/lib/python2.7/httplib.py", line 1127, in getresponse
        response.begin()
      File "/usr/lib/python2.7/httplib.py", line 453, in begin
        version, status, reason = self._read_status()
      File "/usr/lib/python2.7/httplib.py", line 417, in _read_status
        raise BadStatusLine(line)
    httplib.BadStatusLine: ''

代码:

class Crawler():
    def __init__(self,where_to_save, verbose = 0):
        self.link_to_explore = ''
        self.TAG_RE = re.compile(r'<[^>]+>')
        self.TAG_SCRIPT = re.compile(r'<(script).*?</1>(?s)')
        if verbose == 1:
            self.driver = webdriver.Firefox()
        else:
            self.driver = webdriver.PhantomJS()
        self.links = []
        self.next = True
        self.where_to_save = where_to_save
        self.logs = self.where_to_save + "/logs"
        self.outputs = self.where_to_save + "/outputs"
        self.logfile = ''
        self.rnd = 0
        try:
            os.stat(self.logs)
        except:
            os.makedirs(self.logs)
        try:
            os.stat(self.outputs)
        except:
            os.makedirs(self.outputs)
try:
    fin = open(file_to_read,"r")
    FileContent = fin.read()
    fin.close()
    crawler =Crawler(where_to_save)
    data = FileContent.split("n")
    for info in data:
        if info!="":
            to_process = info.split("|")
            link =     to_process[0].strip()
            category = to_process[1].strip().replace(' ','_')
            print "Processing the link: " + link : " + info
            crawler.Init(link,category)
            crawler.Run()
            crawler.End()
    crawler.closeSpider()
except:
    print "The input file cannot be read or no in proper format."
    raise

如果您不希望Timeout停止脚本,您可以捕获异常selenium.common.exceptions.TimeoutException并通过。

您可以使用webdriverset_page_load_timeout()方法设置默认页面加载超时。

像这个

driver.set_page_load_timeout(10)

如果你的页面在10秒内没有加载,这将引发TimeoutException。

编辑:忘了提一下,你将不得不把你的代码放在一个循环中。

添加导入

from selenium.common.exceptions import TimeoutException

while True:
    try:
        # Your code here
        break # Loop will exit
    except TimeoutException:
        pass

最新更新