如何在未知的总页数内迭代链接



我想获得每个页面中的所有应用程序链接。但问题是,每个类别中的总页面并不相同。我有这个代码:

import urllib
from bs4 import BeautifulSoup
url ='http://www.brothersoft.com/windows/mp3_audio/'
pageUrl = urllib.urlopen(url)
soup = BeautifulSoup(pageUrl)
for a in soup.select('div.coLeft.cate.mBottom dd a[href]'):
        print 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce')
        suburl = 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce')
        for page in range(1,27+1):
                content = urllib.urlopen(suburl+'{}.html'.format(page))
                soup = BeautifulSoup(content)
                for a in soup.select('div.freeText dl a[href]'):
                        print 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce')

但我只得到每个类别27页的申请链接。如果其他类别没有27页或超过27页怎么办?

您可以提取程序的数量并将其除以20。例如,如果打开URL-http://www.brothersoft.com/windows/photo_image/font_tools/2.html,则:

import re
import urllib
from bs4 import BeautifulSoup
tmp = re.compile("1-(..)")
url ='http://www.brothersoft.com/windows/photo_image/font_tools/2.html'
pageUrl = urllib.urlopen(url)
soup = BeautifulSoup(pageUrl)
pages = soup.find("div", {"class":"freemenu coLeft Menubox"})
page = pages.text
print int(re.search(r'of ([d]+) ', page).group(1)) / 20 + 1

输出为:

18

对于http://www.brothersoft.com/windows/photo_image/cad_software/6.htmlURL输出将为108

所以你需要打开一些页面,在那里你可以找到多少页面。去掉这个数字,然后你就可以运行循环了。可能是这样的:

import re
import urllib
from bs4 import BeautifulSoup
tmp = re.compile("1-(..)")
url ='http://www.brothersoft.com/windows/photo_image/'
pageUrl = urllib.urlopen(url)
soup = BeautifulSoup(pageUrl)
for a in soup.select('div.coLeft.cate.mBottom dd a[href]'):
        suburl = 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce')
        print suburl
        content = urllib.urlopen(suburl+'2.html')
        soup1 = BeautifulSoup(content)
        pages = soup1.find("div", {"class":"freemenu coLeft Menubox"})
        page = pages.text
        allPages =  int(re.search(r'of ([d]+) ', page).group(1)) / 20 + 1
        print allPages
        for page in range(1, allPages+1):
                content = urllib.urlopen(suburl+'{}.html'.format(page))
                soup = BeautifulSoup(content)
                for a in soup.select('div.freeText dl a[href]'):
                        print 'http://www.brothersoft.com'+a['href'].encode('utf-8','repalce')

最新更新