Apache nutch性能调优整个网页抓取


  <description>Boolean value for whether we should skip parsing for truncated documents. By default this 
  property is activated due to extremely high levels of CPU which parsing can sometimes take.  
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
  <description>The number of seconds the fetcher will delay between 
   successive requests to the same server.</description>
  <description>The number of times a thread will delay when trying to
  fetch a page.  Each time it finds that a host is busy, it will wait
  fetcher.server.delay.  After http.max.delays attepts, it will give
  up on the page for now.</description>
  <description>The minimum number of seconds the fetcher will delay between 
  successive requests to the same server. This value is applicable ONLY
  if fetcher.threads.per.host is greater than 1 (i.e. the host blocking
  is turned off).</description>
  <description>The number of FetcherThreads the fetcher should use.
  This is also determines the maximum number of requests that are
  made at once (each FetcherThread handles one connection). The total
  number of threads running in distributed mode will be the number of
  fetcher threads * number of nodes as fetcher has one map task per node.
  <description>The maximum number of urls in a single
  fetchlist.  -1 if unlimited. The urls are counted according
  to the value of the parameter generator.count.mode.
 If the Crawl-Delay in robots.txt is set to greater than this value (in
 seconds) then the fetcher will skip this page, generating an error report.
 If set to -1 the fetcher will never skip such pages and will wait the
 amount of time retrieved from robots.txt Crawl-Delay, however long that
 might be.



