我为本地爬网配置了nutch-site.xml,其中包含硒交互式插件。
我只配置了基本的,所以配置非常简单(conf/nutch-site.xml中的属性(
<property>
<name>plugin.includes</name>
<value>protocol-interactiveselenium|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
By default Nutch includes plugins to crawl HTML and various other
document formats via HTTP/HTTPS and indexing the crawled content
into Solr. More plugins are available to support more indexing
backends, to fetch ftp:// and file:// URLs, for focused crawling,
and many other use cases.
</description>
</property>
<property>
<name>selenium.driver</name>
<value>chrome</value>
<description>
A String value representing the flavour of Selenium
WebDriver() to use. Currently the following options
exist - 'firefox', 'chrome', 'safari', 'opera' and 'remote'.
If 'remote' is used it is essential to also set correct properties for
'selenium.hub.port', 'selenium.hub.path', 'selenium.hub.host',
'selenium.hub.protocol', 'selenium.grid.driver', 'selenium.grid.binary'
and 'selenium.enable.headless'.
</description>
</property>
<property>
<name>webdriver.chrome.driver</name>
<value>/Users/theo/DISKS/Work/PNR/chromedriver</value>
<description>The path to the ChromeDriver binary</description>
</property>
这是从坚果日志:
2020-08-17 23:40:57,427 ERROR interactiveselenium.Http - Failed to get protocol output
java.lang.RuntimeException: java.lang.IllegalStateException: The driver executable does not exist: /root/chromedriver
at org.apache.nutch.protocol.selenium.HttpWebClient.getDriverForPage(HttpWebClient.java:153)
at org.apache.nutch.protocol.interactiveselenium.HttpResponse.readPlainContent(HttpResponse.java:401)
at org.apache.nutch.protocol.interactiveselenium.HttpResponse.<init>(HttpResponse.java:280)
at org.apache.nutch.protocol.interactiveselenium.Http.getResponse(Http.java:57)
at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:383)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:352)
Caused by: java.lang.IllegalStateException: The driver executable does not exist: /root/chromedriver
at com.google.common.base.Preconditions.checkState(Preconditions.java:585)
at org.openqa.selenium.remote.service.DriverService.checkExecutable(DriverService.java:146)
at org.openqa.selenium.remote.service.DriverService.findExecutable(DriverService.java:141)
at org.openqa.selenium.chrome.ChromeDriverService.access$000(ChromeDriverService.java:35)
at org.openqa.selenium.chrome.ChromeDriverService$Builder.findDefaultExecutable(ChromeDriverService.java:159)
at org.openqa.selenium.remote.service.DriverService$Builder.build(DriverService.java:355)
at org.openqa.selenium.chrome.ChromeDriverService.createDefaultService(ChromeDriverService.java:94)
at org.openqa.selenium.chrome.ChromeDriver.<init>(ChromeDriver.java:157)
at org.apache.nutch.protocol.selenium.HttpWebClient.createChromeWebDriver(HttpWebClient.java:182)
at org.apache.nutch.protocol.selenium.HttpWebClient.getDriverForPage(HttpWebClient.java:89)
... 5 more
2020-08-17 23:40:57,430 INFO fetcher.FetcherThread - FetcherThread 46 fetch of https://www.amazon.in/ failed with: java.lang.RuntimeException: java.lang.IllegalStateException: The driver executable does not exist: /root/chromedriver
为什么它找错地方了?
事实上。。它正确地引用了nutch-site.xml中的其他设置。一旦我包含了interactiveselenium协议,它就开始使用selenium进行获取。
此外,早些时候它正在寻找/root/geckodriver,这是firefox驱动程序。一旦我把selenium.driver改成chrome,它就开始寻找/root/chromedriver。
到目前为止还不错。现在,我更改了webdriver.chrome.driver属性,但似乎没有考虑到这一点。
查看HttpWebClient的代码-属性webdriver.chrome.driver
被selenium.grid.binary
的值覆盖。将后者指向你的镀铬器应该可以。请在打开问题https://issues.apache.org/jira/projects/NUTCH,不清楚这是错误还是文档问题。但无论如何都应该解决。