木偶代码尊重robots.txt文件

这似乎是可能的在Scrapy Scrapy和尊重机器人。txt，但有一个简单的方法来做到这一点在木偶?

我还没有找到一个简单的方法来构建"尊重机器人"。输入Puppeteer命令

我不相信puppeteer有任何内置的东西，但你可以使用puppeteer访问robots.txt，然后使用任何一个npm模块来解析robots.txt，看看你是否被允许获得任何特定的URL。例如，以下是如何使用robots-txt-parser:

const robotsParser = require('robots-txt-parser')
const robots = robotsParser()
// Now inside an async function
// (or not if using a version of Node.js that supports top-level await)
await robots.useRobotsFor('https://example.com/')
if (await robots.canCrawl(urlToVisit)) {
// Do stuff with puppeteer here to visit the URL
} else {
// Inform the user that sadly crawling that URL is forbidden
}

相关内容

最新更新

热门标签：