使用请求进行Web抓取-在网站中选择过滤器



我使用以下代码从AMF网站获取前20个pdf(https://bdif.amf-france.org)。我试着更具体一点,只下载";Déclaration des dirigeants";但我不知道该怎么做。如何在url中集成此筛选器?类似于https://bdif.amf-france.org/back/api/v1/informations?from=0&大小=2?typesInformation=DD。有人能帮忙吗?

import requests
from shutil import copyfileobj
endpoint = "https://bdif.amf-france.org/back/api/v1/informations?from=0&size=20"
base_api_url = "https://bdif.amf-france.org/back/api/v1/documents"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:97.0) Gecko/20100101 Firefox/97.0",
}
with requests.Session() as s:
response = s.get(endpoint, headers=headers).json()
file_sources = [
[
f"{base_api_url}/{item['_source']['documents'][0]['path']}",  # Document
item["_source"]["documents"][0]["nomFichier"]  # File name
]
for item in response["hits"]["hits"]
]
for file in file_sources:
url, name = file
with s.get(url, stream=True) as pdf, open(name, "wb") as output:
copyfileobj(pdf.raw, output)

在url中将typesInformation参数设置为DD,如下所示:

endpoint = 'https://bdif.amf-france.org/back/api/v1/informations?from=0&size=20&typesInformation=DD'

最新更新