调用控制器(crawler4j-3.5)内循环



嗨,我在for-loop中调用controller,因为我有超过100个url,所以我在列表中都有,我将迭代crawl页面,我也为setCustomData设置了该url,因为它不应该离开域。

for (Iterator<String> iterator = ifList.listIterator(); iterator.hasNext();) {
    String str = iterator.next();
    System.out.println("cheking"+str);
    CrawlController controller = new CrawlController(config, pageFetcher,
        robotstxtServer);
    controller.setCustomData(str);
    controller.addSeed(str);
    controller.startNonBlocking(BasicCrawler.class, numberOfCrawlers);
    controller.waitUntilFinish();
}

但是如果我运行上面的代码,在第一个url完全抓取之后,第二个url开始打印错误,如下所示。

50982 [main] INFO edu.uci.ics.crawler4j.crawler.CrawlController  - Crawler 1 started.
51982 [Crawler 1] DEBUG org.apache.http.impl.conn.PoolingClientConnectionManager  - Connection request: [route: {}->http://www.connectzone.in][total kept alive: 0; route allocated: 0 of 100; total allocated: 0 of 100]
60985 [Thread-2] INFO edu.uci.ics.crawler4j.crawler.CrawlController  - It looks like no thread is working, waiting for 10 seconds to make sure...
70986 [Thread-2] INFO edu.uci.ics.crawler4j.crawler.CrawlController  - No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure...
80986 [Thread-2] INFO edu.uci.ics.crawler4j.crawler.CrawlController  - All of the crawlers are stopped. Finishing the process...
80987 [Thread-2] INFO edu.uci.ics.crawler4j.crawler.CrawlController  - Waiting for 10 seconds before final clean up...
91050 [Thread-2] DEBUG org.apache.http.impl.conn.PoolingClientConnectionManager  - Connection manager is shutting down
91051 [Thread-2] DEBUG org.apache.http.impl.conn.PoolingClientConnectionManager  - Connection manager shut down

请帮我解决上面的解决方案,我的交互启动和运行控制器内循环,因为我有很多url在列表。

注意:**我使用的是**crawler4j-3.5.jar及其依赖项

尝试:

for(String url : urls) {
    controller.addSeed(url);
}

和覆盖shouldVisit(WebUrl),使其不能离开域。

相关内容

  • 没有找到相关文章

最新更新