当使用urllib2进行网络抓取时,我如何跟踪链接(或抓取多个链接)



我正在尝试抓取url'http://steamcommunity.com/market/search?q=&category_730_ItemSet%5B%5D=任意&category_730_TurnamentTeam%5B%5D=任意&category_730_武器%5B%5D=任何&category_730_Type%5B%5D=tag_CSGO_Type_Knife&appid=730#p1'(仅供参考),但我似乎不知道如何转到下一页。我当前的代码如下,但它只是在第一页重复循环,而不是转到下一页。

import urllib2
from bs4 import BeautifulSoup
page_num = 1
while True:
    url = 'http://steamcommunity.com/market/search? q=&category_730_ItemSet%5B%5D=any&category_730_TournamentTeam%5B%5D=any&category_730_Weapon%5B%5D=any&category_730_Type%5B%5D=tag_CSGO_Type_Knife&appid=730#p' + str(page_num)
    open_url = urllib2.urlopen(url).read()
    market_page = BeautifulSoup(read_url)
    for i in market_page('div', {'class' : 'market_listing_row      market_recent_listing_row market_listing_searchresult'}):
        item_name = i.find_all('span', {'class' : 'market_listing_item_name'})[0].get_text()
        price = i.find_all('span')[1].get_text()
        page_num += 1
        print  item_name + ' costs ' + price

编辑:此外,我试图抓取的页面的问题是,指向下一个页面的链接没有任何href,所以我使用了一个循环来尝试访问不同的URL,但它只是重复抓取第一个URL。

import urllib2
from bs4 import BeautifulSoup
pages  = 90
for page in range(pages):
    url = 'http://steamcommunity.com/market/search? q=&category_730_ItemSet%5B%5D=any&category_730_TournamentTeam%5B%5D=any&category_730_Weapon%5B%5D=any&category_730_Type%5B%5D=tag_CSGO_Type_Knife&appid=730#p' + str(page)
    open_url = urllib2.urlopen(url).read()
    market_page = BeautifulSoup(read_url)
    for i in market_page('div', {'class' : 'market_listing_row      market_recent_listing_row market_listing_searchresult'}):
        item_name = i.find_all('span', {'class' : 'market_listing_item_name'})[0].get_text()
        price = i.find_all('span')[1].get_text()
        print  item_name + ' costs ' + price

相关内容

  • 没有找到相关文章

最新更新