在多链接的网站中无法抓取标题和作者



我正在尝试抓取这个链接。举个例子,我只想把第一页刮掉。我想收集你在第一页找到的10个链接的标题和作者。

为了收集标题和作者,我编写了以下代码行:

from bs4 import BeautifulSoup
import requests
import numpy as np
url = 'https://www.bis.org/cbspeeches/index.htm?m=1123'

r = BeautifulSoup(requests.get(url).content, features = "lxml")
r.select('#cbspeeches_list a') # '#cbspeeches_list a' got via SelectorGadget

然而,我得到一个空列表。我做错了什么?

谢谢!

数据通过API作为post方法从外部源加载。你只需要使用API的url。

from bs4 import BeautifulSoup
import requests
payload = 'from=&till=&objid=cbspeeches&page=&paging_length=10&sort_list=date_desc&theme=cbspeeches&ml=false&mlurl=&emptylisttext='
url= 'https://www.bis.org/doclist/cbspeeches.htm'
headers= {
"content-type": "application/x-www-form-urlencoded",
"X-Requested-With": "XMLHttpRequest"
}
req=requests.post(url,headers=headers,data=payload)
print(req)
soup = BeautifulSoup(req.content, "lxml")
data=[]
for card in soup.select('.documentList tbody tr'):
title = card.select_one('.title a').get_text()
author = card.select_one('.authorlnk.dashed').get_text().strip()
data.append({
'title':title,
'author':author
})
print(data)

[{'title': 'Pablo Hernández de Cos: Closing ceremony of the academic year 2021-2022', 'author': 'nPablo Hernández de Cos'}, {'title': 'Klaas Knot: Keti Koti 2022 marks turning point for the Netherlands Bank ', 'author': 'nKlaas Knot'}, {'title': 'Luis de Guindos: Challenges for monetary policy', 'author': 'nLuis de Guindos'}, {'title': 'Fabio Panetta: Europe as a common 
shield -  protecting the euro area economy from global shocks', 'author': 'nFabio Panetta'}, 
{'title': 'Victoria Cleland: Rowing in unison to enhance cross-border payments', 'author': 'nVictoria Cleland'}, {'title': 'Yaron Amir: A look at the future world of payments - trends, the market, and regulation', 'author': 'nYaron Amir'}, {'title': 'Ásgeir Jónsson: Speech – 61st Annual Meeting of the Central Bank of Iceland', 'author': 'nÁsgeir Jónsson'}, {'title': 'Lesetja Kganyago: Project Khokha 2 report launch', 'author': 'nLesetja Kganyago'}, {'title': 'Huw Pill: What did the monetarists ever do for us?', 'author': 'nHuw Pill'}, {'title': 'Shaktikanta Das: Inaugural address - Statistics Day Conference ', 'author': 'nShaktikanta Das'}]    


试试这个:

data = {
'from': '',
'till': '',
'objid': 'cbspeeches',
'page': '',
'paging_length': '25',
'sort_list': 'date_desc',
'theme': 'cbspeeches',
'ml': 'false',
'mlurl': '',
'emptylisttext': ''
}
response = requests.post('https://www.bis.org/doclist/cbspeeches.htm', data=data)
soup = BeautifulSoup(response.content)
for elem in soup.find_all("tr"):
# the title
print(elem.find("a").text)
# the author
print(elem.find("a", class_="authorlnk dashed").text)
print()

打印出:

Pablo Hernández de Cos: Closing ceremony of the academic year 2021-2022
Pablo Hernández de Cos
Klaas Knot: Keti Koti 2022 marks turning point for the Netherlands Bank 
Klaas Knot

最新更新