Selenium Python - 获取所有加载的URL(图像，脚本，样式表等)的列表 - Selenium Python - Get a list of all loaded URLs (images, scripts, stylesheets etc) 小贝子编程网

当谷歌浏览器通过Selenium加载网页时，它可能会加载页面所需的其他文件，例如来自<img src="example.com/a.png">或<script src="example.com/a.js">标签。此外，CSS 文件。

如何获取浏览器加载页面时下载的所有 URL 的列表？(以编程方式，在 Python 中使用 Selenium 和 chromedriver( 也就是说，Chrome中开发人员工具的"网络"选项卡中显示的文件列表(显示下载文件的列表(。

使用Selenium，chromedriver的示例代码：

from selenium import webdriver
options = webdriver.ChromeOptions()
options.binary_location = "/usr/bin/x-www-browser"
driver = webdriver.Chrome("./chromedriver", chrome_options=options)
# Load some page
driver.get("https://example.com")
# Now, how do I see a list of downloaded URLs that took place when loading the page above?

你可能想看看BrowserMob Proxy。它可以捕获 Web 应用程序的性能数据(通过 HAR 格式(，以及操纵浏览器行为和流量，例如将内容列入白名单和黑名单、模拟网络流量和延迟以及重写 HTTP 请求和响应。

取自readthedocs，用法很简单，它与Selenium webdriver api集成得很好。您可以在此处阅读有关BMP的更多信息。

from browsermobproxy import Server
server = Server("path/to/browsermob-proxy")
server.start()
proxy = server.create_proxy()
from selenium import webdriver
profile  = webdriver.FirefoxProfile()
profile.set_proxy(proxy.selenium_proxy())
driver = webdriver.Firefox(firefox_profile=profile)

proxy.new_har("google")
driver.get("http://www.google.co.uk")
proxy.har # returns a HAR JSON blob
server.stop()
driver.quit()

继续@GPT14在他的回答中的建议，我写了一个小脚本，它完全完成了我想要的，并打印了某个页面加载的URL列表。

这使用浏览器Mob Proxy。非常感谢@GPT14建议使用它 - 它非常适合我们的目的。我已经从他的答案中更改了代码，并将其改编为Google Chrome网络驱动程序而不是Firefox。我还扩展了脚本，以便它遍历 HAR JSON 输出并列出所有请求 URL。请记住根据您的需求调整以下选项。

from browsermobproxy import Server
from selenium import webdriver
# Purpose of this script: List all resources (URLs) that
# Chrome downloads when visiting some page.
### OPTIONS ###
url = "https://example.com"
chromedriver_location = "./chromedriver" # Path containing the chromedriver
browsermobproxy_location = "/opt/browsermob-proxy-2.1.4/bin/browsermob-proxy" # location of the browsermob-proxy binary file (that starts a server)
chrome_location = "/usr/bin/x-www-browser"
###############
# Start browsermob proxy
server = Server(browsermobproxy_location)
server.start()
proxy = server.create_proxy()
# Setup Chrome webdriver - note: does not seem to work with headless On
options = webdriver.ChromeOptions()
options.binary_location = chrome_location
# Setup proxy to point to our browsermob so that it can track requests
options.add_argument('--proxy-server=%s' % proxy.proxy)
driver = webdriver.Chrome(chromedriver_location, chrome_options=options)
# Now load some page
proxy.new_har("Example")
driver.get(url)
# Print all URLs that were requested
entries = proxy.har['log']["entries"]
for entry in entries:
if 'request' in entry.keys():
print entry['request']['url']
server.stop()
driver.quit()

Selenium Python - 获取所有加载的URL(图像，脚本，样式表等)的列表

相关内容

最新更新

热门标签：