我将使用nutch来抓取大约300个网页。爬行工作很好,直到大约6分钟!在它开始变得越来越慢,直到它下降到接近零性能。我检查了日志,似乎spinWaiting线程的数量随着时间的推移而增加。你能指导我解决这个问题吗?
这是我的nutch-site.xml配置文件: <property>
<name>plugin.folders</name>
<value>/home/nutch/workspace/trunk/src/plugin</value>
</property>
<property>
<name>http.agent.name</name>
<value>nutch-test</value>
</property>
<property>
<name>parser.skip.truncated</name>
<value>false</value>
<description>Boolean value for whether we should skip parsing for truncated documents. By default this
property is activated due to extremely high levels of CPU which parsing can sometimes take.
</description>
</property>
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to external hosts
will be ignored. This is an effective way to limit the crawl to include
only initially injected hosts, without creating complex URLFilters.
</description>
</property>
<property>
<name>fetcher.server.delay</name>
<value>1.0</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server.</description>
</property>
<property>
<name>http.max.delays</name>
<value>2</value>
<description>The number of times a thread will delay when trying to
fetch a page. Each time it finds that a host is busy, it will wait
fetcher.server.delay. After http.max.delays attepts, it will give
up on the page for now.</description>
</property>
<property>
<name>fetcher.server.min.delay</name>
<value>0.5</value>
<description>The minimum number of seconds the fetcher will delay between
successive requests to the same server. This value is applicable ONLY
if fetcher.threads.per.host is greater than 1 (i.e. the host blocking
is turned off).</description>
</property>
<property>
<name>fetcher.threads.per.host</name>
<value>3</value>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>100</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection). The total
number of threads running in distributed mode will be the number of
fetcher threads * number of nodes as fetcher has one map task per node.
</description>
</property>
<property>
<name>generate.max.count</name>
<value>10000</value>
<description>The maximum number of urls in a single
fetchlist. -1 if unlimited. The urls are counted according
to the value of the parameter generator.count.mode.
</description>
</property>
<property>
<name>fetcher.max.crawl.delay</name>
<value>10</value>
<description>
If the Crawl-Delay in robots.txt is set to greater than this value (in
seconds) then the fetcher will skip this page, generating an error report.
If set to -1 the fetcher will never skip such pages and will wait the
amount of time retrieved from robots.txt Crawl-Delay, however long that
might be.
</description>
</property>
<property>
<name>generate.max.per.host</name>
<value>3</value>
</property>
问好。
我认为generate.max.count
的值是相当高的,如果一个网站很慢,你有10000个网址,这可能会减慢它。
你应该试着减少这个数字