如何使用Selenium Webdriver通过url列表下载文件



我写了一段代码,使用Selenium Webdriver通过url列表下载文件,但由于某种原因,它没有下载任何东西到我指定的目录。当我一个接一个地下载代码时,代码工作得很好,但是当我使用for循环时,它就不起作用了。

这是一个示例URL: https://www.regulations.gov/contentStreamer?documentId=WHD-2020-0007-1730&attachmentNumber=1&contentType=pdf

下面是我的代码:
download_dir = '/Users/datawizard/files/'
for web in down_link:
try:
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_experimental_option("prefs", {
"download.default_directory": '/Users/clinton/GRA_2021/scraping_project/pdf/',
"download.prompt_for_download": False,
"download.directory_upgrade": True,
#           "safebrowsing.enabled": True,
"plugins.always_open_pdf_externally": True
})
driver = webdriver.Chrome(chrome_options=options)
driver.command_executor._commands["send_command"] = ("POST", '/session/$sessionId/chromium/send_command')
params = {'cmd': 'Page.setDownloadBehavior', 'params': {'behavior': 'allow', 'downloadPath': download_dir}}
command_result = driver.execute("send_command", params)

driver.get(url)

except:
print(str(web)+"Link cannot be open")

我想知道我做错了什么代码,因为它没有给我任何错误,当我运行上面的代码。

您不需要Selenium来下载文件,您可以使用request库轻松下载文件

import requests
for web in down_link:
fileName = YOUR_DOWNLOAD_PATH + web.split("=")[1].split("&")[0] + ".pdf" #I created a filename

r = requests.get(web, stream=True)
with open(fileName, 'wb') as f:
for chunk in r.iter_content():
f.write(chunk)

基于Selenium的更新答案

#replace the below value with your urls list
down_link = [
'https://www.regulations.gov/contentStreamer?documentId=WHD-2020-0007-1730&attachmentNumber=1&contentType=pdf',
'https://www.regulations.gov/contentStreamer?documentId=WHD-2020-0007-1730&attachmentNumber=1&contentType=pdf']
download_dir = "/Users/datawizard/files/"
options = webdriver.ChromeOptions()
options.add_argument('headless')
options.add_experimental_option("prefs", {
"download.default_directory": download_dir,
"download.prompt_for_download": False,
"download.directory_upgrade": True,
"plugins.always_open_pdf_externally": True
})
driver = webdriver.Chrome(chrome_options=options)

for web in down_link:
driver.get(web)
time.sleep(5) #wait for the download to end, a better handling it's to check if the file exists
driver.quit()

如果你的文件没有唯一的文件名,上面的代码将用下载的文件替换现有的文件。

最新更新