对具有noindex nofollow的风暴爬网程序页面进行爬网



我们使用Stormccrawler 1.13来抓取网站页面。当在一个环境中使用时,它不会对具有robots meta noindex nofollow的页面进行爬网,但当我们在另一个环境下部署相同的模块时,也会对具有noindex nofollow的页面执行爬网。下面是我们的爬网程序conf.yaml.

# Custom configuration for StormCrawler
# This is used to override the default values from crawler-default.xml and provide additional ones 
# for your custom components.
# Use this file with the parameter -conf when launching your extension of ConfigurableTopology.
# This file does not contain all the key values but only the most frequently used ones. See crawler-default.xml for an extensive list.
config: 
topology.workers: 1
topology.message.timeout.secs: 300
topology.max.spout.pending: 100
topology.debug: false
fetcher.threads.number: 50

# give 2gb to the workers
worker.heap.memory.mb: 2048
# mandatory when using Flux
topology.kryo.register:
- com.digitalpebble.stormcrawler.Metadata
# metadata to transfer to the outlinks
# used by Fetcher for redirections, sitemapparser, etc...
# these are also persisted for the parent document (see below)
# metadata.transfer:
# - customMetadataName
# lists the metadata to persist to storage
# these are not transfered to the outlinks
metadata.persist:
- _redirTo
- error.cause
- error.source
- isSitemap
- isFeed
http.agent.name: "Anonymous Coward"
http.agent.version: "1.0"
http.agent.description: "built with StormCrawler Archetype ${version}"
http.agent.url: "http://someorganization.com/"
http.agent.email: "someone@someorganization.com"
# The maximum number of bytes for returned HTTP response bodies.
# The fetched page will be trimmed to 65KB in this case
# Set -1 to disable the limit.
http.content.limit: -1
# FetcherBolt queue dump => comment out to activate
# if a file exists on the worker machine with the corresponding port number
# the FetcherBolt will log the content of its internal queues to the logs
# fetcherbolt.queue.debug.filepath: "/tmp/fetcher-dump-{port}"
parsefilters.config.file: "parsefilters.json"
urlfilters.config.file: "urlfilters.json"
# revisit a page daily (value in minutes)
# set it to -1 to never refetch a page
fetchInterval.default: 1440
# revisit a page with a fetch error after 2 hours (value in minutes)
# set it to -1 to never refetch a page
fetchInterval.fetch.error: 120
# never revisit a page with an error (or set a value in minutes)
fetchInterval.error: -1
# custom fetch interval to be used when a document has the key/value in its metadata
# and has been fetched successfully (value in minutes)
# fetchInterval.FETCH_ERROR.isFeed=true: 30
# fetchInterval.isFeed=true: 10
# configuration for the classes extending AbstractIndexerBolt
# indexer.md.filter: "someKey=aValue"
indexer.url.fieldname: "url"
indexer.text.fieldname: "content"
indexer.canonical.name: "canonical"
indexer.md.mapping:
- parse.title=title
- parse.keywords=keywords
- parse.description=description
- domain=domain
# Metrics consumers:
topology.metrics.consumer.register:
- class: "org.apache.storm.metric.LoggingMetricsConsumer"
parallelism.hint: 1

如果需要对上面的代码或风暴爬行器中的任何其他配置进行一些更改,请告诉我。

谢谢。

meta noindex的行为在1.13中是不可配置的,因此您的环境之间的任何差异都不可能是由于配置的差异。

您是如何生成拓扑的?你用原型了吗?

PS:最好设置http.agent。*配置。

最新更新