我正在使用crawlee@3.0.3
(尚未从github发布(,并试图阻止特定资源使用playwrightUtils.blockRequests
加载(在以前的版本中不可用(。当我尝试官方回购中建议的代码时,它如预期的那样工作:
import { launchPlaywright, playwrightUtils } from 'crawlee';
const browser = await launchPlaywright();
const page = await browser.newPage();
await playwrightUtils.blockRequests(page, {
// extraUrlPatterns: ['adsbygoogle.js'],
});
await page.goto('https://cnn.com');
await page.screenshot({ path: 'cnn_no_images.png' });
await browser.close();
我可以看到这些图片不是从屏幕截图中加载的。我的问题与我使用PlaywrightCrawler
:有关
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 3,
async requestHandler({ page, request }) {
console.log(`Processing: ${request.url}`);
await playwrightUtils.blockRequests(page);
await page.screenshot({ path: 'cnn_no_images2.png' });
},
});
通过这种方式,我无法阻止特定的资源,我的猜测是blockRequests
需要launchPlaywright
才能工作,而我看不到将其传递给PlaywrightCrawler
的方法。blockRequests
已经可以用于puppeteer
,所以可能有人以前尝试过。
此外,我尝试过";"路由拦截";,但是,我还是无法使用PlaywrightCrawler
。
您可以在导航前使用preNavigationHooks
设置任何侦听器或代码,如下所示:
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 3,
preNavigationHooks: [async ({ page }) => {
await playwrightUtils.blockRequests(page);
}],
async requestHandler({ page, request }) {
console.log(`Processing: ${request.url}`);
await page.screenshot({ path: 'cnn_no_images2.png' });
},
});