使用crawlee和player阻止特定资源(css、图像、视频等)



我正在使用crawlee@3.0.3(尚未从github发布(,并试图阻止特定资源使用playwrightUtils.blockRequests加载(在以前的版本中不可用(。当我尝试官方回购中建议的代码时,它如预期的那样工作:

import { launchPlaywright, playwrightUtils } from 'crawlee';
const browser = await launchPlaywright();
const page = await browser.newPage();
await playwrightUtils.blockRequests(page, {
// extraUrlPatterns: ['adsbygoogle.js'],
});
await page.goto('https://cnn.com');
await page.screenshot({ path: 'cnn_no_images.png' });
await browser.close();

我可以看到这些图片不是从屏幕截图中加载的。我的问题与我使用PlaywrightCrawler:有关

const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 3,
async requestHandler({ page, request }) {
console.log(`Processing: ${request.url}`);
await playwrightUtils.blockRequests(page);
await page.screenshot({ path: 'cnn_no_images2.png' });
},
});

通过这种方式,我无法阻止特定的资源,我的猜测是blockRequests需要launchPlaywright才能工作,而我看不到将其传递给PlaywrightCrawler的方法。blockRequests已经可以用于puppeteer,所以可能有人以前尝试过。

此外,我尝试过";"路由拦截";,但是,我还是无法使用PlaywrightCrawler

您可以在导航前使用preNavigationHooks设置任何侦听器或代码,如下所示:


const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 3,
preNavigationHooks: [async ({ page }) => {
await playwrightUtils.blockRequests(page);
}],
async requestHandler({ page, request }) {
console.log(`Processing: ${request.url}`);
await page.screenshot({ path: 'cnn_no_images2.png' });
},
});

最新更新