Crawler4j:获取(robots)url时出错

我们正在使用crawler4j从网页中获取一些通知，根据官方文件，我完成了以下示例：

ArticleCrawler.java

public class ArticleCrawler extends WebCrawler
{
private final static Pattern FILTERS = Pattern.compile(".*(\.(css|js|bmp|gif|jpe?g" + "|png|tiff?|mid|mp2|mp3|mp4"
+ "|wav|avi|mov|mpeg|ram|m4v|pdf" + "|rm|smil|wmv|swf|wma|zip|rar|gz))$");

/**
* This method receives two parameters. The first parameter is the page in
* which we have discovered this new url and the second parameter is the new
* url. You should implement this function to specify whether the given url
* should be crawled or not (based on your crawling logic). In this example,
* we are instructing the crawler to ignore urls that have css, js, git, ...
* extensions and to only accept urls that start with
* "http://www.ics.uci.edu/". In this case, we didn't need the referringPage
* parameter to make the decision.
*/
@Override
public boolean shouldVisit(Page referringPage, WebURL url)
{
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches() && href.startsWith("http://www.ics.uci.edu/");
}
/**
* This function is called when a page is fetched and ready to be processed
* by your program.
*/
@Override
public void visit(Page page)
{
String url = page.getWebURL().getURL();
log.info("ArticleCrawler: crawlers cover url {}", url);
}

}

Controller.java

public class Controller
{
public static void main(String[] args) throws Exception {
String crawlStorageFolder = "/";
int numberOfCrawlers = 7;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
/*
* Instantiate the controller for this crawl.
*/
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
/*
* For each crawl, you need to add some seed urls. These are the first
* URLs that are fetched and then the crawler starts following links
* which are found in these pages
*/
controller.addSeed("http://www.ics.uci.edu/~welling/");
controller.addSeed("http://www.ics.uci.edu/~lopes/");
controller.addSeed("http://www.ics.uci.edu/");
/*
* Start the crawl. This is a blocking operation, meaning that your code
* will reach the line after this only when crawling is finished.
*/
controller.start(ArticleCrawler.class, numberOfCrawlers);
}
}

得到了错误：

错误〔RobotstxtServer:128〕2016-04-12 17:38:59672-获取(robots)url时出错：http://www.ics.uci.edu/robots.txtorg.apache.http.client.client协议异常网址：org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:186)网址：org.apache.http.impl.client.CloseableHttpClient.exexecute(CloseableHttpClient.java:82)网址：org.apache.http.impl.client.CloseableHttpClient.exexecute(CloseableHttpClient.java:106)位于edu.uci.ics.cracler4j.fetcher.PageFetcher.fetchPage(PageFetcher.java:237)位于edu.uci.ics.cracler4j.robottxt.RobotstxtServer.fetchDirectives(RobotstextServer.java:100)位于edu.uci.ics.cracler4j.robottxt.RobotstxtServer.allows(RobotstextServer.java:80)位于edu.uci.ics.cracler4j.crawler.CrawlController.addSeed(CrawlControll.java:427)位于edu.uci.ics.cracler4j.crawler.CrawlController.addSeed(CrawlControlr.java:381)网址：com.waijule.common.crawler.article.Controller.main(Controller.java:31)由：org.apache.http.HttpException引起：不支持的cookie策略：默认网址：org.apache.http.client.procol.RequestAddCookies.produce(RequestAddCookies.java:150)网址：org.apache.http.procol.ImmutableHttpProcessor.prrocess(ImmutableHttpProcessor.java:132)网址：org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:193)网址：org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86)网址：org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108)网址：org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)…还有8个信息〔CrawlController:230〕2016-04-12 17:38:59699-Crawler 1已启动INFO〔CrawlController:230〕2016-04-12 17:38:59700-Crawler 2已启动INFO〔CrawlController:230〕2016-04-12 17:38:59700-Crawler 3已启动INFO〔CrawlController:230〕2016-04-12 17:38:59701-Crawler 4已启动信息〔CrawlController:230〕2016-04-12 17:38:59701-Crawler 5已启动INFO〔CrawlController:230〕2016-04-12 17:38:59701-Crawler 6已启动信息〔CrawlController:230〕2016-04-12 17:38:59701-Crawler 7已启动警告〔WebCrawler:412〕2016-04-12 17:38:59864-获取时未处理的异常http://www.ics.uci.edu/~welling/：null信息〔WebCrawler:357〕2016-04-12 17:38:59864-Stacktrace：org.apache.http.client.client协议异常网址：org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:186)网址：org.apache.http.impl.client.CloseableHttpClient.exexecute(CloseableHttpClient.java:82)网址：org.apache.http.impl.client.CloseableHttpClient.exexecute(CloseableHttpClient.java:106)位于edu.uci.ics.cracler4j.fetcher.PageFetcher.fetchPage(PageFetcher.java:237)位于edu.uci.ics.cracler4j.crawler.WebCrawler.processPage(WebCrawler.java:323)位于edu.uci.ics.cracler4j.crawler.WebCrawler.run(WebCrawler.java:274)在java.lang.Thread.run(线程.java：745)由：org.apache.http.HttpException引起：不支持的cookie策略：默认网址：org.apache.http.client.procol.RequestAddCookies.produce(RequestAddCookies.java:150)网址：org.apache.http.procol.ImmutableHttpProcessor.prrocess(ImmutableHttpProcessor.java:132)网址：org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:193)网址：org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86)网址：org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108)网址：org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)…还有6个WARN〔WebCrawler:412〕2016-04-12 17:39:00071-获取时出现未处理的异常http://www.ics.uci.edu/~lopes/：null信息〔WebCrawler:357〕2016-04-12 17:39:00071-堆栈跟踪：org.apache.http.client.client协议异常网址：org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:186)网址：org.apache.http.impl.client.CloseableHttpClient.exexecute(CloseableHttpClient.java:82)网址：org.apache.http.impl.client.CloseableHttpClient.exexecute(CloseableHttpClient.java:106)位于edu.uci.ics.cracler4j.fetcher.PageFetcher.fetchPage(PageFetcher.java:237)位于edu.uci.ics.cracler4j.crawler.WebCrawler.processPage(WebCrawler.java:323)位于edu.uci.ics.cracler4j.crawler.WebCrawler.run(WebCrawler.java:274)在java.lang.Thread.run(线程.java：745)由：org.apache.http.HttpException引起：不支持的cookie策略：默认网址：org.apache.http.client.procol.RequestAddCookies.produce(RequestAddCookies.java:150)网址：org.apache.http.procol.ImmutableHttpProcessor.prrocess(ImmutableHttpProcessor.java:132)网址：org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:193)网址：org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86)网址：org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108)网址：org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)…还有6个警告〔WebCrawler:412〕2016-04-12 17:39:00273-获取时出现未处理的异常http://www.ics.uci.edu/:无效的信息〔WebCrawler:357〕2016-04-12 17:39:00274-堆栈跟踪：org.apache.http.client.client协议异常网址：org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:186)网址：org.apache.http.impl.client.CloseableHttpClient.exexecute(CloseableHttpClient.java:82)网址：org.apache.http.impl.client.CloseableHttpClient.exexecute(CloseableHttpClient.java:106)位于edu.uci.ics.cracler4j.fetcher.PageFetcher.fetchPage(PageFetcher.java:237)位于edu.uci.ics.cracler4j.crawler.WebCrawler.processPage(WebCrawler.java:323)位于edu.uci.ics.cracler4j.crawler.WebCrawler.run(WebCrawler.java:274)在java.lang.Thread.run(线程.java：745)由：org.apache.http.HttpException引起：不支持的cookie策略：默认网址：org.apache.http.client.procol.RequestAddCookies.produce(RequestAddCookies.java:150)网址：org.apache.http.procol.ImmutableHttpProcessor.prrocess(ImmutableHttpProcessor.java:132)网址：org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:193)网址：org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:86)网址：org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:108)网址：org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)

此外，我阅读了源代码，根据try_catch模块，我不能很好地理解，这是源代码链接：https://github.com/yasserg/crawler4j/blob/master/src/main/java/edu/uci/ics/crawler4j/robotstxt/RobotstxtServer.java

谢谢。

我已经解决了它，它是由4.2版本使用过时的cookie规范版本引起的，请将其检查为4.1或更低，到目前为止，使用4.1版本是更好的选择。您可能会从提取请求中找到更多通知。https://github.com/yasserg/crawler4j/pull/120

相关内容

最新更新

热门标签：