我想下载来自在线毒素的PDF。为了打开它,必须先登录。然后打开PDF并下载它。
以下是我的代码。它可以登录到页面,并且PDF也可以打开。但是无法下载PDF,因为我不确定如何模拟单击"保存"。我使用Firefox。
import os, time
from selenium import webdriver
from bs4 import BeautifulSoup
# Use firefox dowmloader to get file
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2)
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.download.dir", 'D:/eBooks/Stocks_andCommodities/2008/Jul/')
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf")
fp.set_preference("pdfjs.disabled", "true")
# disable Adobe Acrobat PDF preview plugin
fp.set_preference("plugin.scan.plid.all", "false")
fp.set_preference("plugin.scan.Acrobat", "99.0")
browser = webdriver.Firefox(firefox_profile=fp)
# Get the login web page
web_url = 'http://technical.traders.com/sub/sublogin2.asp'
browser.get(web_url)
# SImulate the authentication
user_name = browser.find_element_by_css_selector('#SubID > input[type="text"]')
user_name.send_keys("thomas2003@test.net")
password = browser.find_element_by_css_selector('#SubName > input[type="text"]')
password.send_keys("LastName")
time.sleep(2)
submit = browser.find_element_by_css_selector('#SubButton > input[type="submit"]')
submit.click()
time.sleep(2)
# Open the PDF for downloading
url = 'http://technical.traders.com/archive/articlefinal.asp?file=V26C07\131INTR.pdf'
browser.get(url)
time.sleep(10)
# How to simulate the Clicking to Save/Download the PDF here?
您不应在浏览器中打开文件。一旦您拥有文件URL。与所有cookie
获取请求会话def get_request_session(driver):
import requests
session = requests.Session()
for cookie in driver.get_cookies():
session.cookies.set(cookie['name'], cookie['value'])
return session
有会话后,您可以使用相同的
下载文件url = 'http://technical.traders.com/archive/articlefinal.asp?file=V26C07\131INTR.pdf'
session = get_request_session(driver)
r = session.get(url, stream=True)
chunk_size = 2000
with open('/tmp/mypdf.pdf', 'wb') as file:
for chunk in r.iter_content(chunk_size):
file.write(chunk)
除了Tarun的解决方案外,您还可以通过JS下载文件并将其存储为斑点。然后,您可以通过Selinium的执行脚本将数据提取到Python中,如本答案所示。
在您的情况下,
url = 'http://technical.traders.com/archive/articlefinal.asp?file=V26C07\131INTR.pdf'
browser.execute_script("""
window.file_contents = null;
var xhr = new XMLHttpRequest();
xhr.responseType = 'blob';
xhr.onload = function() {
var reader = new FileReader();
reader.onloadend = function() {
window.file_contents = reader.result;
};
reader.readAsDataURL(xhr.response);
};
xhr.open('GET', %(download_url)s);
xhr.send();
""".replace('rn', ' ').replace('r', ' ').replace('n', ' ') % {
'download_url': json.dumps(url),
})
现在,您的数据作为窗口对象上的斑点存在,因此您可以轻松提取到Python:
time.sleep(3)
downloaded_file = driver.execute_script("return (window.file_contents !== null ? window.file_contents.split(',')[1] : null);")
with open('/Users/Chetan/Desktop/dummy.pdf', 'wb') as f:
f.write(base64.b64decode(downloaded_file))
尝试
import urllib
file_path = "<FILE PATH TO SAVE>"
urllib.urlretrieve(<pdf_link>,file_path)