For Loop试图抓取TripAdvisor餐厅数据



我正在尝试抓取香港所有餐厅及其相应的url列表。目前,在我的代码下面,我能够刮第一和第二页。但是我想让我的底部for循环更动态一些,并一直抓取,直到它达到我在range()中指定的条目数量。

我在这方面还是个新手,所以任何帮助都会很棒。

#import libraries
import requests
from bs4 import BeautifulSoup
import csv

#scrape the first page because this URL is different then when you start moving to different pages
url0 = 'https://www.tripadvisor.com/Restaurants-g294217-Hong_Kong.html#EATERY_LIST_CONTENTS'
r = requests.get(url0)
data = r.text
soup = BeautifulSoup(r.text, "html.parser")
for link in soup.findAll('a', {'property_title'}):
    print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
    print link.string
#loop to move into the next pages. entries are in increments of 30 per page
for i in range(0, 120, 30):
    entries = str(30)
    #url format offsets the restaurants in increments of 30 after the oa; hence entries as variable
    url1 = 'https://www.tripadvisor.com/Restaurants-g294217-oa' + entries + '-Hong_Kong.html#EATERY_LIST_CONTENTS'
    r1 = requests.get(url1)
    data1 = r1.text
    soup1 = BeautifulSoup(data1, "html.parser")
    for link in soup1.findAll('a', {'property_title'}):
        print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
        print link.string
    break

最后添加了一个while,让它按照我想要的方式循环。希望这对将来的人有所帮助

for i in range(30, 120, 30):
    while i <= range:
        i = str(i)
        #url format offsets the restaurants in increments of 30 after the oa; hence entries as variable
        url1 = 'https://www.tripadvisor.com/Restaurants-g294217-oa' + i + '-Hong_Kong.html#EATERY_LIST_CONTENTS'
        r1 = requests.get(url1)
        data1 = r1.text
        soup1 = BeautifulSoup(data1, "html.parser")
        for link in soup1.findAll('a', {'property_title'}):
            print 'https://www.tripadvisor.com/Restaurant_Review-g294217-' + link.get('href')
            print link.string
        break

最新更新