从 div 类 python 中提取 p 以获取地址



目前代码:找到所有健身房的网址并输入csv,如下所示:

https://www.lifetime.life/life-time-locations/al-vestavia-hills.html
https://www.lifetime.life/life-time-locations/az-biltmore.html

我希望它做什么:我在从每个 url 中提取地址时遇到问题。我在地址部分的尝试是在下面"代码"底部的第 4 行和第 5 行。确切的错误是:

gymrow.append(address_line1[0].text)
IndexError: list index out of range

代码*

import urllib2
import BeautifulSoup
initial_url = "https://www.lifetime.life"
request = urllib2.Request("https://www.lifetime.life/view-all-locations.html")
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
with open('gyms2.csv', 'w') as gf:
gymwriter = csv.writer(gf)
for a in soup.findAll('a'):
if '/life-time-locations/' in a['href']:
gymurl1 = (urlparse.urljoin(initial_url, a.get('href')))
sitemap_content = requests.get(gymurl1).content
gymrow = [gymurl1]
address_line1 = soup.select('p[class~=small m-b-sm p-t-1] > span[class~=btn-icon-text]')
gymrow.append(address_line1[0].text)
print(gymrow)
gymwriter.writerow(gymrow)
time.sleep(3)

检查元素的图像:p类,span类和我要抓取的地址

谢谢!

您从子页面获取 HTML,但未转换为soup因此在主页上搜索

response = requests.get(gymurl)
sub_soup = BeautifulSoup(response.text)

我也遇到了CSS选择器的问题

address_line = sub_soup.select('p.small.m-b-sm.p-t-1 span.btn-icon-text')

有些页面在这个地方没有元素,它会引发错误,所以我使用try/except来捕获它。


在 Python 3 上测试,因为在 Python 2 上.select()对我不起作用

import requests
from bs4 import BeautifulSoup
import urllib.parse
import csv
import time
initial_url = "https://www.lifetime.life"
response = requests.get("https://www.lifetime.life/view-all-locations.html")
soup = BeautifulSoup(response.text)
with open('gyms2.csv', 'w') as gf:
gymwriter = csv.writer(gf)
for a in soup.findAll('a'):
if '/life-time-locations/' in a['href']:
gymurl = urllib.parse.urljoin(initial_url, a.get('href'))
print(gymurl)
response = requests.get(gymurl)
sub_soup = BeautifulSoup(response.text)
try:
address_line = sub_soup.select('p.small.m-b-sm.p-t-1 span.btn-icon-text')
gymrow = [gymurl, address_line[0].text.strip()]
print(gymrow)
gymwriter.writerow(gymrow)
time.sleep(3)
except Exception as ex:
print(ex)

编辑:Python 2使用find()而不是select()

import requests
import BeautifulSoup
import csv
import urllib2
import time
initial_url = "https://www.lifetime.life"
response = requests.get("https://www.lifetime.life/view-all-locations.html")
soup = BeautifulSoup.BeautifulSoup(response.text)
with open('gyms2.csv', 'w') as gf:
gymwriter = csv.writer(gf)
for a in soup.findAll('a'):
if '/life-time-locations/' in a['href']:
gymurl = urllib2.urlparse.urljoin(initial_url, a.get('href'))
print(gymurl)
response = requests.get(gymurl)
sub_soup = BeautifulSoup.BeautifulSoup(response.text)
try:
address_line = sub_soup.find('p', {'class': 'small m-b-sm p-t-1'}).find('span', {'class': 'btn-icon-text'})
gymrow = [gymurl, address_line.text]
print(gymrow)
gymwriter.writerow(gymrow)
time.sleep(3)
except Exception as ex:
print(ex)

编辑:似乎有很多版本的页面。每个页面可能需要分隔try/except。但是,如果第一个try正常工作,我不会将第二次尝试/除非放在第一个except中,而是使用continue跳过下一个try/except

import requests
from bs4 import BeautifulSoup
import urllib.parse
import csv
import time
initial_url = "https://www.lifetime.life"
response = requests.get("https://www.lifetime.life/view-all-locations.html")
soup = BeautifulSoup(response.text)
with open('gyms2.csv', 'w') as gf:
gymwriter = csv.writer(gf)
for a in soup.findAll('a'):
if '/life-time-locations/' in a['href']:
gymurl = urllib.parse.urljoin(initial_url, a.get('href'))
print(gymurl)
response = requests.get(gymurl)
sub_soup = BeautifulSoup(response.text)
try:
address_line = sub_soup.select('p.small.m-b-sm.p-t-1 span.btn-icon-text')
gymrow = [gymurl, address_line[0].text.strip()]
print('type 1:', gymrow)
gymwriter.writerow(gymrow)
time.sleep(3)
continue # go back to `for`            
except Exception as ex:
print('ex:', ex)
try:
address_line = sub_soup.find('div', {'class': 'btn-resp-md'}).find('p')
gymrow = [gymurl, address_line.text.strip()]
print('type 2:', gymrow)
gymwriter.writerow(gymrow)
time.sleep(3)
continue # go back to `for`            
except Exception as ex:
print('ex:', ex)
try:
address_line = sub_soup.find('p', {'class': 'm-b-grid'})
gymrow = [gymurl, address_line.text.strip()]
print('type 3:', gymrow)
gymwriter.writerow(gymrow)
time.sleep(3)
continue # go back to `for`            
except Exception as ex:
print('ex:', ex)

最新更新