美丽汤抓取其他数据



我正在使用Python和BeautifulSoup从这个网站抓取一些数据。

法典:

import requests
from bs4 import BeautifulSoup
eclipse_time_date = requests.get("https://www.timeanddate.com/eclipse/")
soup = BeautifulSoup(eclipse_time_date.text, 'html.parser')
eclipse_info = soup.find_all("div", class_= "six columns art__eclipse-txt")
for info in eclipse_info:
print("Eclipse Date: {0}".format(info.find('a').text))
print("Location: {0}".format(info.find('p').text))

输出:

Eclipse Date: July 13, 2018 — Partial Solar Eclipse
Location: South in Australia, Pacific, Indian Ocean New Features: Path Map | 3D Path Globe | Eclipse Information
Eclipse Date: July 27, 2018 — Total Lunar Eclipse
Location: Much of Europe, Much of Asia, Australia, Africa, South in North America, South America, Pacific, Atlantic, Indian Ocean, Antarctica New Features: Path Map | 3D Path Globe | Eclipse Information

我的问题是,位置、 New Features:等后面的部分也有一个p标签。如何忽略该部分,以便我的输出为:

Eclipse Date: July 13, 2018 — Partial Solar Eclipse
Location: South in Australia, Pacific, Indian Ocean
Eclipse Date: July 27, 2018 — Total Lunar Eclipse
Location: Much of Europe, Much of Asia, Australia, Africa, South in North America, South America, Pacific, Atlantic, Indian Ocean, Antarctica

我可以使用split(),并找到New的索引,但是,有些地方在该位置中有"新"一词,例如"纽约"或"新奥尔良"。

我想知道是否有一种方法可以使用BeautifulSoup提取数据?

您可以使用article标记查找 HTML 数据,然后创建更高级的分组:

import requests, re
from bs4 import BeautifulSoup as soup
s = soup(requests.get('https://www.timeanddate.com/eclipse/').text, 'html.parser')  
groups = [i for i in s.find_all('article', {'class':'art__eclipse-nxt pdflexi'})]
new_groups = [[getattr(i, c) for c in ['figcaption', 'h3', 'p']] for i in groups]
for date, title, description in new_groups:
print('Title: {}'.format(title.text))
print('Date: {}'.format(date.text))
print('Description: {}'.format(soup(re.sub('<i class="i-font"></i>', '', str(description)), 'html.parser').find('p').text))
print('-'*20)

输出:

Title: July 13, 2018 — Partial Solar Eclipse
Date: Jul 13, 2018
Description: South in Australia, Pacific, Indian Ocean New Features: 
Path Map | 3D Path Globe | Eclipse Information
--------------------
Title: July 27, 2018 — Total Lunar Eclipse
Date: Jul 27, 2018
Description: Much of Europe, Much of Asia, Australia, Africa, South in 
North America, South America, Pacific, Atlantic, Indian Ocean, 
Antarctica New Features: Path Map | 3D Path Globe | Eclipse Information
--------------------

最新更新