在python中通过抓取子url下载文件



我正在尝试从大量的web链接中下载文档(主要是pdf格式),例如:

https://projects.worldbank.org/en/projects-operations/document-detail/P167897?type=projects

https://projects.worldbank.org/en/projects-operations/document-detail/P173997?type=projects

https://projects.worldbank.org/en/projects-operations/document-detail/P166309?type=projects

但是,pdf文件不能从这些链接直接访问。需要单击子url来访问pdf文件。是否有办法抓取子url并从中下载所有相关文件?我正在尝试使用以下代码,但到目前为止还没有任何成功,特别是对于这里列出的这些url。

如果你需要进一步的说明,请告诉我。我很乐意这样做。谢谢你。
from simplified_scrapy import Spider, SimplifiedDoc, SimplifiedMain, utils
class MySpider(Spider):
name = 'download_pdf'
allowed_domains = ["www.worldbank.org"]
start_urls = [
"https://projects.worldbank.org/en/projects-operations/document-detail/P167897?type=projects",
"https://projects.worldbank.org/en/projects-operations/document-detail/P173997?type=projects",
"https://projects.worldbank.org/en/projects-operations/document-detail/P166309?type=projects"
]  # Entry page
def afterResponse(self, response, url, error=None, extra=None):
if not extra:
print ("The version of library simplified_scrapy is too old, please update.")
SimplifiedMain.setRunFlag(False)
return
try:
path = './pdfs'
# create folder start
srcUrl = extra.get('srcUrl')
if srcUrl:
index = srcUrl.find('year/')
year = ''
if index > 0:
year = srcUrl[index + 5:]
index = year.find('?')
if index>0:
path = path + year[:index]
utils.createDir(path)
# create folder end
path = path + url[url.rindex('/'):]
index = path.find('?')
if index > 0: path = path[:index]
flag = utils.saveResponseAsFile(response, path, fileType="pdf")
if flag:
return None
else:  # If it's not a pdf, leave it to the frame
return Spider.afterResponse(self, response, url, error, extra)
except Exception as err:
print(err)
def extract(self, url, html, models, modelNames):
doc = SimplifiedDoc(html)
lst = doc.selects('div.list >a').contains("documents/", attr="href")
if not lst:
lst = doc.selects('div.hidden-md hidden-lg >a')
urls = []
for a in lst:
a["url"] = utils.absoluteUrl(url.url, a["href"])
# Set root url start
a["srcUrl"] = url.get('srcUrl')
if not a['srcUrl']:
a["srcUrl"] = url.url
# Set root url end
urls.append(a)
return {"Urls": urls}
# Download again by resetting the URL. Called when you want to download again.
def resetUrl(self):
Spider.clearUrl(self)
Spider.resetUrlsTest(self)
SimplifiedMain.startThread(MySpider())  # Start download

有一个API端点,其中包含您在网站上看到的整个响应以及…文档pdf的URL。: D

所以,你可以查询API,获取url,最后获取文档。

方法如下:

import requests
pids = ["P167897", "P173997", "P166309"]
for pid in pids:
end_point = f"https://search.worldbank.org/api/v2/wds?" 
f"format=json&includepublicdocs=1&" 
f"fl=docna,lang,docty,repnb,docdt,doc_authr,available_in&" 
f"os=0&rows=20&proid={pid}&apilang=en"
documents = requests.get(end_point).json()["documents"]
for document_data in documents.values():
try:
pdf_url = document_data["pdfurl"]
print(f"Fetching: {pdf_url}")
with open(pdf_url.rsplit("/")[-1], "wb") as pdf:
pdf.write(requests.get(pdf_url).content)
except KeyError:
continue

输出:(完整下载的。pdf文件)

Fetching: http://documents.worldbank.org/curated/en/106981614570591392/pdf/Official-Documents-Grant-Agreement-for-Additional-Financing-Grant-TF0B4694.pdf
Fetching: http://documents.worldbank.org/curated/en/331341614570579132/pdf/Official-Documents-First-Restatement-to-the-Disbursement-Letter-for-Grant-D6810-SL-and-for-Additional-Financing-Grant-TF0B4694.pdf
Fetching: http://documents.worldbank.org/curated/en/387211614570564353/pdf/Official-Documents-Amendment-to-the-Financing-Agreement-for-Grant-D6810-SL.pdf
Fetching: http://documents.worldbank.org/curated/en/799541612993594209/pdf/Sierra-Leone-AFRICA-WEST-P167897-Sierra-Leone-Free-Education-Project-Procurement-Plan.pdf
Fetching: http://documents.worldbank.org/curated/en/310641612199201329/pdf/Disclosable-Version-of-the-ISR-Sierra-Leone-Free-Education-Project-P167897-Sequence-No-02.pdf
and more ...

最新更新