Apache nutch性能调优整个网页抓取



我将使用nutch来抓取大约300个网页。爬行工作很好,直到大约6分钟!在它开始变得越来越慢,直到它下降到接近零性能。我检查了日志,似乎spinWaiting线程的数量随着时间的推移而增加。你能指导我解决这个问题吗?

这是我的nutch-site.xml配置文件:
 <property>
   <name>plugin.folders</name>
   <value>/home/nutch/workspace/trunk/src/plugin</value>
 </property>
 <property>
  <name>http.agent.name</name>
  <value>nutch-test</value>
 </property>
<property>
  <name>parser.skip.truncated</name>
  <value>false</value>
  <description>Boolean value for whether we should skip parsing for truncated documents. By default this 
  property is activated due to extremely high levels of CPU which parsing can sometimes take.  
  </description>
</property>
<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  </description>
</property>
<property>
  <name>fetcher.server.delay</name>
  <value>1.0</value>
  <description>The number of seconds the fetcher will delay between 
   successive requests to the same server.</description>
</property>
<property>
  <name>http.max.delays</name>
  <value>2</value>
  <description>The number of times a thread will delay when trying to
  fetch a page.  Each time it finds that a host is busy, it will wait
  fetcher.server.delay.  After http.max.delays attepts, it will give
  up on the page for now.</description>
</property>
<property>
  <name>fetcher.server.min.delay</name>
  <value>0.5</value>
  <description>The minimum number of seconds the fetcher will delay between 
  successive requests to the same server. This value is applicable ONLY
  if fetcher.threads.per.host is greater than 1 (i.e. the host blocking
  is turned off).</description>
</property>
<property>
  <name>fetcher.threads.per.host</name>
  <value>3</value>
</property>
<property>
  <name>fetcher.threads.fetch</name>
  <value>100</value>
  <description>The number of FetcherThreads the fetcher should use.
  This is also determines the maximum number of requests that are
  made at once (each FetcherThread handles one connection). The total
  number of threads running in distributed mode will be the number of
  fetcher threads * number of nodes as fetcher has one map task per node.
  </description>
</property>
<property>
  <name>generate.max.count</name>
  <value>10000</value>
  <description>The maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.
  </description>
</property>
<property>
 <name>fetcher.max.crawl.delay</name>
 <value>10</value>
 <description>
 If the Crawl-Delay in robots.txt is set to greater than this value (in
 seconds) then the fetcher will skip this page, generating an error report.
 If set to -1 the fetcher will never skip such pages and will wait the
 amount of time retrieved from robots.txt Crawl-Delay, however long that
 might be.
 </description>
</property> 
<property>
  <name>generate.max.per.host</name>
  <value>3</value>
</property>

问好。

我认为generate.max.count的值是相当高的,如果一个网站很慢,你有10000个网址,这可能会减慢它。

你应该试着减少这个数字

最新更新