我正在尝试使用Beautiful Soup在IMDb上抓取标题页的官方网站的数据。例如,如果我需要获得Interseller的数据,我有这样的代码:
url = 'https://www.imdb.com/title/tt0816692/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
title_detail_soup = soup.find('div', {'id': 'titleDetails'})
details_soup = title_detail_soup.find_all('div', class_='txt-block')
detail_list = ['Official Sites:', 'Country:', 'Language:',
'Release Date:', 'Also Known As:', 'Filming Locations:']
details = {}
for detail in details_soup:
try:
# Each heading (h4) has detail heading
head = detail.find('h4')
if head.get_text() in detail_list:
# If the detail heading is in the detail list
if head.get_text() == 'Official Sites:':
# If details is about official sites
official_site = {}
detail.h4.decompose() # remove <h4> tags
a_tags = detail.find_all('a')
for a_tag in a_tags:
# exclude See more>> links
if a_tag.get_text() != 'See more':
data = url+a_tag['href'] # final link is base URL + hyperlink
official_site[a_tag.get_text()] = data
details['official-sites'] = official_site
except Exception as e:
print(e)
print(details) # Print the detail dictionary
页面的HTML:
<div class="article" id="titleDetails">
<span class="rightcornerlink">
<a href="https://contribute.imdb.com/updates?edit=tt0816692/details&ref_=tt_dt_dt">Edit</a>
</span>
<h2>Details</h2>
<div class="txt-block">
<h4 class="inline">Official Sites:</h4>
<a href="/offsite/?page-action=offsite-facebook&token=BCYpckvEa_ZSPp2TC3Ztr1DNqde5ZCUHig7950CLYvsgSHOzBCfJSHpgg71IYRsZYP1DuUpTZb9H%0D%0AhK4BzY5AiKU5Vy2oFn7i91MVFT_TnR39yhU5V5NBAse2mY_ht5WdsmSBxQPGRBC6pIJJym7IXbao%0D%0ATz9SG3r8MjKfwIe9hBrJU5Y-vNdnR_uaDq_24s2NGj5ikJYWl_093YIHy_I2lnK-I6jK9OvOpwgw%0D%0AupABQOymuxA%0D%0A&ref_=tt_pdt_ofs_offsite_0" rel="nofollow">Official Facebook</a>
<span class="ghost">|</span>
<a href="/offsite/?page-action=offsite-interstellarmovie&token=BCYuB9Ouy5QXl_3W_k3RrnnXUdrfSLbBFfOcrJTX0yo5TtTDqsSLpry8x7drK8l0xpOJSEqt73Hz%0D%0A08qyki3_i83CrCym7SXSkevFQpT32TjuuJLgIlQ-W5CpRd-wZC9eD4R3SZOMdOfSjeoOtqiE5uU_%0D%0Az-YG1i5AImXY2xLmHSNwABh1hU7VHS-FnqKDW9G-4KOF78zpKdDIfrwlRs8px0yef9u51LojZz05%0D%0A0OBfTmRs_JI%0D%0A&ref_=tt_pdt_ofs_offsite_1" rel="nofollow">Official site</a>
<span class="ghost">|</span>
<span class="see-more inline">
<a href="externalsites?ref_=tt_dt_dt#official">See more</a> »
</span>
</div>
</div>
我已经使用它成功地将数据提取到字典格式中,但当我使用字典中的超链接时,它们不起作用,并为找不到请求的URL发出错误
输出字典:
{
'official-sites': {
'Official Facebook': 'https://www.imdb.com/title/tt0816692/offsite/?page-action=offsite-facebook&token=BCYqzjQrP9OA_yaYNwA9Q8hI5gt41EmHuu0_ePjZPHKui-hEmAEySo-0SHzZmSjpeeEVy3Art6SH%0D%0ATseW16b3uKMjIH8iOyO-ZVYR025mQ4YCbZIWUKEcEM-z0eOeUvud3KGbuQTCxrNhTGAx7xgFIB89%0D%0Al9jT6pvqSpSCdNYACnBhk_8MuNjCn8GIJZk-6PR1MZ1xQB5yDrqRNhNt9Dg8IDMXVpxTR8-LFu2I%0D%0Amf5KmXbmXos%0D%0A',
'Official site': 'https://www.imdb.com/title/tt0816692/offsite/?page-action=offsite-interstellarmovie&token=BCYsMb9WTKJLH9M9nmxvLDpn8ikQDnQmpVQZBurp9Trd1-XXbA_Bh4xoKx6yf3Qx4YNn3fT9UhFe%0D%0AnzcULcEY5SFJ7CW8kBj6dQvZA9GyvqfZMyIDS7daNe6rne6DkdL23CDPAkk1Xwr9rjiE6FF_m0vX%0D%0ASLH2NnzOf8BcKnaWILhGGdvHTYeZ_uRGm4QCIOzxw-CvLM2rag04ZbXM2ZUEvQm6OedW9XumtsnQ%0D%0AoP7ce67sytE%0D%0A'
}
}
对于所有调用get_text((的代码,请确保对象不是空的,
尝试使用这个:
url = 'https://www.imdb.com/title/tt0816692/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
title_detail_soup = soup.find('div', {'id': 'titleDetails'})
headings_soup = title_detail_soup.find_all(['h2', 'h3'])
details_soup = title_detail_soup.find_all('div', class_='txt-block')
detail_list = ['Official Sites:', 'Country:', 'Language:',
'Release Date:', 'Also Known As:', 'Filming Locations:']
details = {}
for detail in details_soup:
try:
head = detail.find('h4')
if head.get_text() in detail_list:
if head.get_text() == 'Official Sites:':
official_site = {}
detail.h4.decompose()
a_tags = detail.find_all('a')
for a_tag in a_tags:
if a_tag.get_text() != 'See more':
data = url +a_tag['href']
official_site[a_tag.text] = data
details['official-sites'] = official_site
except Exception as e:
# print(e)
pass
print(details)