如何获取在 <a> python 中使用美丽汤的 href 属性中的数据?


import requests
from bs4 import BeautifulSoup
url = 'https://www.maritimecourier.com/restaurant'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/75.0.3770.80 Safari/537.36'}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
test = soup.select('.underline-body-links .sqs-block a, .underline-body-links .entry- 
content a, .underline-body-links .eventlist-excerpt a, .underline-body-links 
.playlist-description a, .underline-body-links .image-description a, .underline-body- 
links .sqs-block a:visited, .underline-body-links .entry-content a:visited, 
.underline-body-links .eventlist-excerpt a:visited, .underline-body-links .playlist- 
description a:visited, .underline-body-links .image-description a:visited')
test
通过这段代码,我得到了这个输出
[<a href="https://www.instagram.com/breakfast_dreams/" target="_blank">Breakfast Dreams</a>,
<a href="https://www.maritimecourier.com/breakfast-dreams" target="_blank">MARITIME</a>,
<a href="https://www.instagram.com/latarantellalb/" target="_blank">La Tarantella</a>]

现在,我正试图从a标签

中获取URL和名称我想知道我该怎么做。到目前为止,我尝试了这个:

results = []
for restaurant in soup.select('.underline-body-links .sqs-block a, .underline-body-links .entry-content a, .underline-body-links .eventlist-excerpt a, .underline-body-links .playlist-description a, .underline-body-links .image-description a, .underline-body-links .sqs-block a:visited, .underline-body-links .entry-content a:visited, .underline-body-links .eventlist-excerpt a:visited, .underline-body-links .playlist-description a:visited, .underline-body-links .image-description a:visited'):
results.append({
'title':restaurant.find('a',{'target':'_blank'}).text
})
results

但我得到了这个

'NoneType' object has no attribute 'text'

您的选择不是很清楚,也是预期的输出-主要问题是您仍然选择了<a>s,并试图在<a>中找到<a>

所以你的提取部分应该更像这样:

results.append({
'title': restaurant.text,
'url': restaurant.get('href')
})

你也可以让你的选择更具体:

[{'title':a.text, 'url':a.get('href')} for a in soup.select('.sqs-block-content a')]

或去掉所有内部链接:

[{'title':a.text, 'url':a.get('href')} for a in soup.select('.sqs-block-content a') if 'maritimecourier' not in a.get('href')]

最新更新