nutch Fetch失败,协议状态为:异常(16),lastModified=0:Http代码=403,url=htt



我正在为urlurl=https://www.modernfamilydental.net/执行语法分析器o/p获取失败,协议状态:exception(16), lastModified=0: Http code=403, url=https://www.modernfamilydental.net/

我可以知道问题是什么以及如何解决吗?我试着更改代理名称,但没有成功。请帮帮我。

nutch site.xml

<property>
<name>http.agent.name</name>
<value>crawlbot</value>
</property> 
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|query-(basic|site|url|lang)|indexer-csv|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags|text|js|feed)|index-(basic|anchor|more|metadata)</value>
</property>
<property>
<name>db.ignore.external.links</name>
<value>true</value>
</property>
<property>
<name>db.ignore.external.links.mode</name>
<value>byDomain</value>
</property>
<property>
<name>fetcher.server.delay</name>
<value>2</value>
</property>
<property>
<name>fetcher.server.min.delay</name>
<value>0.5</value>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>400</value>
</property>
<property>
<name>fetcher.max.crawl.delay</name>
<value>10</value>
<description> If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the fetcher will skip this page, generating an error report. If set to -1 the fetcher will never skip such pages and will wait the amount of time retrieved from robots.txt Crawl-Delay, however long that might be. </description>
</property>

根据您的要求在评论

如何集成Nutch的代理设置

有很多免费的(比如https://www.sslproxies.org/)和付费(你可以在网上找到许多付费代理(代理服务器,你可以很容易地集成到Nutch。

Nutch(1.16(提供了许多与代理服务器集成相关的配置。

<property>
<name>http.proxy.host</name>
<value>ip-address</value>
<description>The proxy hostname.  If empty, no proxy is used.</description>
</property>
<property>
<name>http.proxy.port</name>
<value>proxy port</value>
<description>The proxy port.</description>
</property>
<property>
<name>http.proxy.username</name>
<value>blahblah</value>
<description>Username for proxy. This will be used by
'protocol-httpclient', if the proxy server requests basic, digest
and/or NTLM authentication. To use this, 'protocol-httpclient' must
be present in the value of 'plugin.includes' property.
NOTE: For NTLM authentication, do not prefix the username with the
domain, i.e. 'susam' is correct whereas 'DOMAINsusam' is incorrect.
</description>
</property>
<property>
<name>http.proxy.password</name>
<value>blahblah</value>
<description>Password for proxy. This will be used by
'protocol-httpclient', if the proxy server requests basic, digest
and/or NTLM authentication. To use this, 'protocol-httpclient' must
be present in the value of 'plugin.includes' property.
</description>
</property>
<property>
<name>http.proxy.realm</name>
<value></value>
<description>Authentication realm for proxy. Do not define a value
if realm is not required or authentication should take place for any
realm. NTLM does not use the notion of realms. Specify the domain name
of NTLM authentication as the value for this property. To use this,
'protocol-httpclient' must be present in the value of
'plugin.includes' property.
</description>
</property>
<property>
<name>http.proxy.type</name>
<value>HTTP</value>
<description>
Proxy type: HTTP or SOCKS (cf. java.net.Proxy.Type).
Note: supported by protocol-okhttp.
</description>
</property>
<property>
<name>http.proxy.exception.list</name>
<value>nutch.org,abc.com</value>
<description>A comma separated list of hosts that don't use the proxy
(e.g. intranets). Example: www.apache.org</description>
</property>

如果您在nutchlib中看到http插件代码,它是所有http库的接口插件,如(protocolhttp、protocolhttpclient、protocolokhttp..等(

org.apache.nutch.protocol.http.api.HttpBase
public void setConf(Configuration conf) {
this.conf = conf;
this.proxyHost = conf.get("http.proxy.host");
this.proxyPort = conf.getInt("http.proxy.port", 8080);
this.proxyType = Proxy.Type.valueOf(conf.get("http.proxy.type", "HTTP"));
this.proxyException = arrayToMap(conf.getStrings("http.proxy.exception.list"));
this.useProxy = (proxyHost != null && proxyHost.length() > 0);
this.timeout = conf.getInt("http.timeout", 10000);
.........................................
.........................................

正如您从上面的代码中看到的,Nutch在初始化HTTPclient对象时使用这些配置。

在查看了plugin.includesconf之后,如果您查看configClient方法中的**org.apache.nutch.protocol.httpclient.Http**代码,则您正在使用协议httpclient

This particular specific code will integrate proxy server to httpclient
// HTTP proxy server details
if (useProxy) {
hostConf.setProxy(proxyHost, proxyPort);
if (proxyUsername.length() > 0) {
AuthScope proxyAuthScope = getAuthScope(this.proxyHost, this.proxyPort,
this.proxyRealm);
NTCredentials proxyCredentials = new NTCredentials(this.proxyUsername,
this.proxyPassword, Http.agentHost, this.proxyRealm);
client.getState().setProxyCredentials(proxyAuthScope, proxyCredentials);
}
}

nutch正在设置proxyObject,这样您通过httpclient发出的每个请求都将发送到代理服务器。

我建议你把fetcher.server.min.delay增加到2秒,这样可以确保另一端不会被滥用。

出于测试目的,您可以使用本教程

这是http.agent.version的问题,他们在更改代理版本后阻止了它,解决了问题。

最新更新