使用漂亮的汤刮网页时结果不一致



我遇到了一个不一致的问题,这让我抓狂。我正在设法搜集有关出租单位的数据。假设我们有一个有42个广告的网页,代码只适用于19个广告,然后返回:

Traceback (most recent call last):
File "main.py", line 53, in <module>
title = real_state_title.div.h1.text.strip()
AttributeError: 'NoneType' object has no attribute 'div'

如果你启动代码来处理从不同广告号开始的广告,比如说5,它也会处理前19个广告,然后引发相同的错误!

以下是显示我遇到的问题的最低代码。请注意,这段代码将打印一个正常运行的广告的HTML,也打印一个有错误的广告。印刷出来的东西是如此的不同。

运行代码,然后更改i的值以查看结果。

from bs4 import BeautifulSoup as soup  # HTML data structure
from urllib.request import urlopen as uReq  # Web client
import traceback

page_url = "https://www.kijiji.ca/b-apartments-condos/saint-john/c37l80017?ll=45.273315%2C-66.063308&address=Saint+John%2C+NB&ad=offering&radius=20.0"
# opens the connection and downloads html page from url
uClient = uReq(page_url)
# parses html into a soup data structure to traverse html
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
# finds each ad from Kijiji web page
containers = page_soup.findAll('div', {'class': 'clearfix'})
# Print the number of ads in this web page
print(f'Number of ads in this web page is {len(containers)}')
print_functioning_ad = True
# Loop throw ads
i = 1  # change to start from a different ad (don't put zero)
for container in containers[i:]:
print(f'Ad No.: {i}n')
i += 1
# Get the link for this specific ad
ad_link_container = container.find('div', {'class': 'title'})
ad_link = 'https://kijiji.ca' + ad_link_container.a['href']
print(ad_link)
single_ad = uReq(ad_link)
# parses html into a soup data structure to traverse html
page_soup2 = soup(single_ad.read(), "html.parser")
single_ad.close()
# Title
real_state_title = page_soup2.find('div', {'class': 'realEstateTitle-1440881021'})
# Print one functioning ad html
if print_functioning_ad:
print_functioning_ad = False
print(page_soup2)
print('real state title type', type(real_state_title))
try:
title = real_state_title.div.h1.text.strip()
print(title)
except Exception:
print(traceback.format_exc())
print(page_soup2)
break
print('____________________________________________________________')

编辑1:

在我的简单例子中,我想循环浏览提供的链接中的每个广告,打开它,然后获得标题。在我的实际代码中,我不仅得到了标题,还得到了关于广告的其他信息。所以我需要从与每个广告相关的链接中加载数据。我的代码实际上做到了这一点,但由于未知的原因,无论我从哪一个广告开始,这种情况只发生在19个广告中。这让我发疯了!

要从URL获取所有页面,可以使用下一个示例:

import requests
from bs4 import BeautifulSoup

page_url = "https://www.kijiji.ca/b-apartments-condos/saint-john/c37l80017?ll=45.273315%2C-66.063308&address=Saint+John%2C+NB&ad=offering&radius=20.0"
page = 1
while True:
print("Page {}...".format(page))
print("-" * 80)
soup = BeautifulSoup(requests.get(page_url).content, "html.parser")
for i, a in enumerate(soup.select("a.title"), 1):
print(i, a.get_text(strip=True))
next_url = soup.select_one('a[title="Next"]')
if not next_url:
break
print()
page += 1
page_url = "https://www.kijiji.ca" + next_url["href"]

打印:

Page 1...
--------------------------------------------------------------------------------
1 Spacious One Bedroom Apartment
2 3 Bedroom Quispamsis
3 Uptown-two-bedroom apartment for rent - all-inclusive
4 New Construction!! Large 2 Bedroom Executive Apt
5 LARGE 1 BEDROOM UPTOWN $850 HEAT INCLUDED AVAIABLE JULY 1
6 84 Wright St Apt 2
7 310 Woodward Ave (Brentwood Tower) Condo #1502
...
Page 5...
--------------------------------------------------------------------------------
1 U02 - CHFR - Cozy 1 Bedroom + Den - WEST SAINT JOHN
2 2+ Bedroom Historic Renovated Stainless Kitchen
3 2 Bedroom Apartment - 343 Prince Street West
4 2 Bedroom 5th Floor Loft Apartment in South End Saint John
5 Bay of Fundy view from luxury 5th floor 1 bedroom + den suite
6 Suites of The Atlantic - Renting for Fall 2021: 2 bedrooms
7 WOODWARD GARDENS//2 BR/$945 + LIGHTS//MAY//MILLIDGEVILLE//JULY
8 HEATED & SMOKE FREE - Bach & 1Bd Apt - 50% off 1st month's rent
9 Beautiful 2 bedroom apartment in Millidgeville
10 Spacious 2 bedroom in Uptown Saint John
11 3 bedroom apartment at Millidge Ave close to university ave
12 Big Beautiful 3 bedroom apt. in King Square
13 NEWER HARBOURVIEW SUITES UNFURNISHED OR FURNISHED /BLUE ROCK
14 Rented
15 Completely Renovated - 1 Bedroom Condo w/ small den Brentwood
16 1+1 Bedroom Apartment for rent for 2 persons
17 3 large bedroom apt. in King Street East Saint John,NB
18 Looking for a house
19 Harbour View 2 Bedroom Apartment
20 Newer Harbourview suites unfurnished or furnished /Blue Rock Ct
21 LOVELY 2 BEDROOM APARTMENT FOR LEASE 5 WOODHOLLOW PARK EAST SJ

我想我已经解决了这里的问题。我觉得你不可能在短时间内发出很多请求,所以我添加了一个try: except:语句,当这个错误发生时,会发出80秒的时间睡眠,这解决了我的问题!

你可能想将睡眠时间段更改为不同的值,这取决于你试图从哪个网站上抓取。

这是修改后的代码:

from bs4 import BeautifulSoup as soup  # HTML data structure
from urllib.request import urlopen as uReq  # Web client
import traceback
import time

page_url = "https://www.kijiji.ca/b-apartments-condos/saint-john/c37l80017?ll=45.273315%2C-66.063308&address=Saint+John%2C+NB&ad=offering&radius=20.0"
# opens the connection and downloads html page from url
uClient = uReq(page_url)
# parses html into a soup data structure to traverse html
page_soup = soup(uClient.read(), "html.parser")
uClient.close()
# finds each ad from Kijiji web page
containers = page_soup.findAll('div', {'class': 'clearfix'})
# Print the number of ads in this web page
print(f'Number of ads in this web page is {len(containers)}')
print_functioning_ad = True
# Loop throw ads
i = 1  # change to start from a different ad (don't put zero)
for container in containers[i:]:
print(f'Ad No.: {i}n')
i = i + 1
# Get the link for this specific ad
ad_link_container = container.find('div', {'class': 'title'})
ad_link = 'https://kijiji.ca' + ad_link_container.a['href']
print(ad_link)
single_ad = uReq(ad_link)
# parses html into a soup data structure to traverse html
page_soup2 = soup(single_ad.read(), "html.parser")
single_ad.close()
# Title
real_state_title = page_soup2.find('div', {'class': 'realEstateTitle-1440881021'})
try:
title = real_state_title.div.h1.text.strip()
print(title)
except AttributeError:
print(traceback.format_exc())
i = i - 1
t = 80
print(f'----------------------------Sleep for {t} seconds!')
time.sleep(t)
continue
print('____________________________________________________________')

最新更新