使用Chromedriver下载文件到AWS Lambda /tmp文件目录



我正在尝试从这个网站https://registry.verra.org/app/search/VCS/All%20Projects使用Chromedriver自动下载一个文件到AWS Lambda的/tmp目录。步骤如下:1)点击"搜索"按钮,2)等待结果加载,3)点击"Excel"标志下载文件。

我已经参考并尝试了这2个问题中提供的代码来更改Chromedriver的下载目录,但文件没有下载到/tmp路径。

AWS Lambda使用Chromedriver下载文件

prefs = {
"profile.default_content_settings.popups": 0,
"download.default_directory": r"/tmp",
"directory_upgrade": True
}
options.add_experimental_option("prefs", prefs)

无法使用Selenium将AWS lambda中的chrome默认下载位置更改为/tmp

options = webdriver.ChromeOptions()
prefs = {"browser.downloads.dir": "//tmp//", "download.default_directory": "//tmp//", "directory_upgrade": True}
options.add_experimental_option("prefs", prefs)

作为参考,这里是我的代码,在本地工作得很好,但在AWS Lambda中不行。

from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.chrome.options import Options 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC
import time
import os
import requests 
import requests.auth
import json
import csv
def lambda_handler(event, context):

# change directory to /tmp folder
os.chdir('/tmp')

# get dataset from website
options = Options()
options.binary_location = '/opt/headless-chromium'
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--single-process')
options.add_argument('--disable-dev-shm-usage')

## SAVE TO TMP DIRECTORY
# set download settings
prefs = {
"profile.default_content_settings.popups": 0,
"download.default_directory": r"/tmp",
"directory_upgrade": True
}
options.add_experimental_option("prefs", prefs)


## open Chrome webdriver
driver = webdriver.Chrome('/opt/chromedriver',options=options)
driver.maximize_window()
driver.get('https://registry.verra.org/app/search/VCS/All%20Projects')

# wait for 60 seconds for website content to load
print("Waiting for website to load...")
element1 = WebDriverWait(driver, 60).until(EC.presence_of_element_located((By.XPATH, '/html/body/apx-root/div/div/apx-search-page/div/apx-search-container/div/div[2]/div/div[1]/apx-search-selection-criteria/div/form/div[2]/div/button[1]')))
print("Website loaded!")
# click on search button to load results
search_btn = driver.find_element(By.XPATH, '/html/body/apx-root/div/div/apx-search-page/div/apx-search-container/div/div[2]/div/div[1]/apx-search-selection-criteria/div/form/div[2]/div/button[1]')
search_btn.click()
# wait for results to load for 100 seconds - determine by checking the page numbers
element2 = WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.XPATH, '/html/body/apx-root/div/div/apx-search-page/div/apx-search-container/div/div[2]/div/div[2]/apx-project-search-results/div/div/kendo-grid/kendo-pager/kendo-pager-numeric-buttons/ul/li[1]/a')))
print("Results loaded!")
# wait for download button to load for 100 seconds - determine by detecting presence of download button
element = WebDriverWait(driver, 100).until(EC.presence_of_element_located((By.XPATH, '/html/body/apx-root/div/div/apx-search-page/div/apx-search-container/div/div[2]/div/div[2]/apx-project-search-results/div/apx-search-results-header/div/button[1]')))
download_btn = driver.find_element(By.XPATH, '/html/body/apx-root/div/div/apx-search-page/div/apx-search-container/div/div[2]/div/div[2]/apx-project-search-results/div/apx-search-results-header/div/button[1]')

# click on download button
# if element is not clickable
filepath = driver.execute_script("arguments[0].click();", element)

# wait for 60 seconds for file to download
time.sleep(60)


# check if file is downloaded to /tmp directory
# Method 2   
list = os.listdir('/tmp')
print("list", list)
response = {
"statusCode": 200,
"body": "Selenium Headless Chrome Initialized"
}

return response

这可能是版本问题吗?因为我的代码只有在运行时设置为Python 3.7时才能工作。

硒版本:selenium/python/lib/python3.7/site-packages selenium==3.8.0 (runtime python3.7)

Chromedriver版本:https://chromedriver.storage.googleapis.com/2.37/chromedriver_linux64.zip(只有运行时Python 3.9可以工作,切换到Python 3.7不能工作)

无头Chrome:https://github.com/adieuadieu/serverless - chrome/releases/download/v1.0.0 41/stable -无头-铬- amazonlinux - 2017 - 03. - zip

我也有同样的问题,并在这篇文章中找到了答案。显然,chromedriver处理由按钮单击生成的POST请求发起的下载的方式有所不同。

添加以下代码修复了我的问题:

driver.command_executor._commands["send_command"] = (
"POST",
"/session/$sessionId/chromium/send_command",
)
params = {
"cmd": "Page.setDownloadBehavior",
"params": {"behavior": "allow", "downloadPath": '/tmp'},
}
data = driver.execute("send_command", params)

希望这对你有帮助!

最新更新