Webscrape-获取链接/href



我正试图进入一个网页,获取每一行的href/链接。

目前,代码只是打印空白。

预期输出是打印网页中每一行的href/链接。

import requests
from bs4 import BeautifulSoup
url = 'https://meetings.asco.org/meetings/2022-gastrointestinal-cancers-symposium/286/program-guide/search?q=&pageNumber=1&size=20'
baseurl='https://ash.confex.com/ash/2021/webprogram/'
res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')

productlist = soup.find_all('div',class_='session-card')
for b in productlist:
links = b["href"]
print(links)

会发生什么

首先,仔细看看你的汤,你不会找到你搜索的信息,因为你会被屏蔽。

此外,您选择的find_all('div',class_='session-card')中的元素没有直接属性href

如何修复

在您的请求中添加一些标题:

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
res = requests.get(url, headers=headers)

在迭代中额外选择<a>以选择链接并获得href:

b.a["href"]

示例

import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
url = 'https://meetings.asco.org/meetings/2022-gastrointestinal-cancers-symposium/286/program-guide/search?q=&pageNumber=1&size=20'
baseurl='https://ash.confex.com/ash/2021/webprogram/'
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.content,'html.parser')
for b in soup.find_all('div',class_='session-card'):
links = b.a["href"]
print(links)

最新更新