使用Playwright for Python捕获和存储请求数据



我正在python中使用playwright(在放弃了proxymob方法之后(,我正试图使用以下代码捕获给定请求/响应中的所有标头:

import asyncio
from time import sleep
from urllib import request
from playwright.async_api import async_playwright
from playwright.sync_api import sync_playwright
async def main():
async with async_playwright() as p:
for browser_type in [p.chromium, p.firefox, p.webkit]:
browser = await browser_type.launch(headless=False)
page = await browser.new_page()
page.on("request", lambda request: print(str(request.method), str(request.url), str(request.all_headers())))
page.on("response", lambda response: print(str(response.all_headers())))

# I navigate to some page
await asyncio.sleep(15)
asyncio.run(main())

我得到的输出是:

<bound method Request.all_headers of <Request url='...' method='GET'>
<bound method Response.all_headers of <Response url='...'>

正如你所看到的,我得到的输出是没有用的。现在,如果我使用";sync";方法我可以在输出中看到实际的标题。然而,我使用异步方法,因为我想在浏览时捕获数据,而不必对导航进行硬编码(在这一点上,明智的做法是使用devtools(。

以下是我的问题:

  1. 如何使用上面的async方法在输出中公开标头
  2. 我该如何将上述输出存储在词典中

Lambda需要一个函数,我尝试创建一个自定义函数,将输出添加到字典中,但最终没有存储任何内容(无论是async还是sync(。

我不习惯使用异步,我不确定你的问题,但我认为这就是你想要的:

import asyncio
from playwright.async_api import async_playwright

async def main():
async with async_playwright() as p:
for browser_type in [p.chromium, p.firefox, p.webkit]:
browser = await browser_type.launch(headless=False)
page = await browser.new_page()

# Now, we gonna expect a request, but that request we have to expect it after doing something, in my example is the navigation to google, in your example could be clicking a button or something else.
async with page.expect_request("https://www.google.com/images/branding/googlelogo/1x/googlelogo_color_272x92dp.png") as first:
await page.goto("https://google.com")
first_request = await first.value
print(f"The method of the request was: {first_request.method}")
print(f"The url of the request was: {first_request.url}")
print()
print("And the headers:")
# You can use this first_request.all_headers() as a Json
print(await first_request.all_headers())
await asyncio.sleep(5)
asyncio.run(main())

我用谷歌做的,你应该用自己的页面做,知道什么应该是请求url。

这里有一些医生:https://playwright.dev/python/docs/api/class-page#page-等待请求

最新更新