需要澄清应该访问和访问Crawler4j的方法

我需要使用Crawler4j从网站下载PDF。我按照这个文档创建了两个类：

PDFCrawler
PDFCrawlController

现在，在我的PDFCrawler类中，我有一个shouldVisit(Page page, WebURL url)方法，如下所示：

public boolean shouldVisit(Page page, WebURL url) {
    String href = url.getURL().toLowerCase(); 
    return href.startsWith(crawlDomain) && pdfPatterns.matcher(href).matches();
}

这里，crawlDomain是从PDFCrawlController类（例如，http://www.example.com）发送的域。pdfPatterns定义如下：

private static final Pattern pdfPatterns = Pattern.compile(".*(\.(pdf?))$");

PDFCrawler类中的visit(Page page)方法开始如下：

    public void visit(Page page) {
        String url = page.getWebURL().getURL();
        if (!pdfPatterns.matcher(url).matches()) {
            System.out.println("I am in " + url);
            System.out.println("No match. Leaving.");
            return;
        }
//and so on...
}

现在，当我将http://www.example.com发送到PDFCrawler时，visit(Page page)方法中的System.out.println()打印如下：

I am in http://www.example.com/allforgood
No match. Leaving.
I am in http://www.another-web-site.iastate.edu/grants/xp2011-02
No match. Leaving.
I am in http://www.example.com/careers
No match. Leaving.
I am in http://www.example.com/wp-content/uploads/2014/01/image-happenings1.png
No match. Leaving.

我的问题是：

为什么爬网程序要转到another-web-site？我不是在shouldVisit()方法中限制它这样做吗
为什么它访问来自同一域的实际上是图像的页面（例如png）？我不是在shouldVisit()方法中限制它这样做吗

您的shouldVisit函数没有被调用。它没有适用于较新版本的正确声明。你在学习这个例子，但是这个例子是错误的。

唯一的参数是URL。你可以在API这里看到它。

此外，当您使用@Override表示法时，您可以捕捉到类似的情况。Java会告诉你，你实际上并没有覆盖你想要的东西

相关内容

最新更新

热门标签：