爬网程序4j,一些URL被爬网而没有问题,而其他URL则根本没有被爬网



我一直在玩Crawler4j,成功地让它抓取了一些页面,但没有成功地抓取其他页面。例如,我已经让它成功地用以下代码爬Reddi:

public class Controller {
    public static void main(String[] args) throws Exception {
        String crawlStorageFolder = "//home/user/Documents/Misc/Crawler/test";
        int numberOfCrawlers = 1;
        CrawlConfig config = new CrawlConfig();
       config.setCrawlStorageFolder(crawlStorageFolder);
        /*
         * Instantiate the controller for this crawl.
         */
        PageFetcher pageFetcher = new PageFetcher(config);
        RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
        RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
        CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
        /*
         * For each crawl, you need to add some seed urls. These are the first
         * URLs that are fetched and then the crawler starts following links
         * which are found in these pages
         */
        controller.addSeed("https://www.reddit.com/r/movies");
        controller.addSeed("https://www.reddit.com/r/politics");

        /*
         * Start the crawl. This is a blocking operation, meaning that your code
         * will reach the line after this only when crawling is finished.
         */
        controller.start(MyCrawler.class, numberOfCrawlers);
    }

}

与:

@Override
 public boolean shouldVisit(Page referringPage, WebURL url) {
     String href = url.getURL().toLowerCase();
     return !FILTERS.matcher(href).matches()
            && href.startsWith("https://www.reddit.com/");
 }

在MyCrawler.java中。但是,当我尝试爬网时http://www.ratemyprofessors.com/程序只是挂起而没有输出,并且不会爬网任何内容。我在myController.java中使用了与上面类似的以下代码:

controller.addSeed("http://www.ratemyprofessors.com/campusRatings.jsp?sid=1222");
controller.addSeed("http://www.ratemyprofessors.com/ShowRatings.jsp?tid=136044");

在MyCrawler.java:中

 @Override
 public boolean shouldVisit(Page referringPage, WebURL url) {
     String href = url.getURL().toLowerCase();
     return !FILTERS.matcher(href).matches()
            && href.startsWith("http://www.ratemyprofessors.com/");
 }

所以我想知道:

  • 某些服务器是否能够立即识别爬网程序,而不允许它们收集数据
  • 我注意到RateMyProfessor页面是.jsp格式;这和它有关系吗
  • 有什么方法可以让我更好地调试它吗?控制台不输出任何内容

crawler4j尊重爬网程序的政治性,如robots.txt。在您的情况下,此文件如下。

检查这个文件显示,它是不允许抓取你的给定种子点:

 Disallow: /ShowRatings.jsp 
 Disallow: /campusRatings.jsp 

这一理论得到了crawler4j日志输出的支持

2015-12-15 19:47:18,791 WARN  [main] CrawlController (430): Robots.txt does not allow this seed: http://www.ratemyprofessors.com/campusRatings.jsp?sid=1222
2015-12-15 19:47:18,793 WARN  [main] CrawlController (430): Robots.txt does not allow this seed: http://www.ratemyprofessors.com/ShowRatings.jsp?tid=136044

我也有类似的问题,我得到的错误消息是:

2017-01-18 14:18:21136警告〔Crawler 1〕e.u.i.c.c.WebCrawler〔:412〕获取时未处理的异常http://people.com/:people.com:80未能响应
2017-01-18 14:18:21140信息〔Crawler 1〕e.u.i.c.c.WebCrawler〔:357〕Stacktrace:org.apache.http.NoHttpResponseException:people.com:80未能响应

但我确信people.com会对浏览器做出响应。

相关内容

  • 没有找到相关文章

最新更新