带有crawler4j的爬网https页面



几个月来,我们使用crawler4j爬网站。突然,自从上周五以来,我们无法爬行相同的HTTPS网站。HTTPS协议中有什么变化了吗?该网站是https://enot.publicprocurement.be/enot-war/home.do

作为测试,只需尝试获取标题:welkom op het平台e-notification

任何帮助都非常感谢。

我发现它在设置crawlconfig

时效果最好
 CrawlConfig config = new CrawlConfig();
 config.setIncludeHttpsPages(true);
 config.setUserAgentString("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36");
 PageFetcher pageFetcher = new PageFetcher(config);

我也有同样的问题。为了解决此问题,我们需要一个自定义的PageFetcher。您可以在这里找到样本。http://code.google.com/p/crawler4j/issues/detail?id=174

您可以使用此pagefetcher子类代替pagefetcher。这为我解决了所有问题。

import java.security.KeyManagementException;
import java.security.KeyStoreException;
import java.security.NoSuchAlgorithmException;
import javax.net.ssl.SSLContext;
import org.apache.http.ssl.SSLContextBuilder;
import org.apache.http.client.config.RequestConfig;
import org.apache.http.conn.ssl.NoopHostnameVerifier;
import org.apache.http.impl.client.HttpClients;
import org.apache.http.impl.conn.PoolingHttpClientConnectionManager;
import edu.uci.ics.crawler4j.crawler.CrawlConfig;
import edu.uci.ics.crawler4j.fetcher.PageFetcher;
public class PageFetcher2 extends PageFetcher {
public static final String DEFAULT_USER_AGENT = "Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:45.0) Gecko/20100101 Firefox/45.0";
public static final RequestConfig DEFAULT_REQUEST_CONFIG = RequestConfig.custom().setConnectTimeout(30 * 1000)
        .setSocketTimeout(60 * 1000).build();
public PageFetcher2(CrawlConfig config) throws KeyManagementException, NoSuchAlgorithmException, KeyStoreException {
    super(config);
    PoolingHttpClientConnectionManager connectionManager = new PoolingHttpClientConnectionManager();
    connectionManager.setMaxTotal(30);
    connectionManager.setDefaultMaxPerRoute(30);
    SSLContext sslContext = new SSLContextBuilder()
              .loadTrustMaterial(null, (certificate, authType) -> true).build();
    httpClient = HttpClients.custom()
              .setSSLContext(sslContext)
              .setSSLHostnameVerifier(new NoopHostnameVerifier())
              .setConnectionManager(connectionManager)
              .setUserAgent(DEFAULT_USER_AGENT)
              .setDefaultRequestConfig(DEFAULT_REQUEST_CONFIG)
              .build();
}
}

相关内容

  • 没有找到相关文章

最新更新