我尝试用硒制作网络爬网。我的程序解雇了StaleelementReferenceException。我认为那是因为我爬行页面递归,并且页面没有更多链接函数导航到下一页,而不是以前到父母页面。
因此,当当前URL不等于父级URL时,我引入了树数据结构,以导航回父。但这不是解决我问题的解决方案。
有人可以帮助我吗?
代码:
public class crawler {
private static FirefoxDriver driver;
private static String main_url = "https://robhammond.co/tools/seo-crawler";
private static List<String> uniqueLinks = new ArrayList<String>();
public static void main(String[] args) {
driver = new FirefoxDriver();
Node<String> root = new Node<>(main_url);
scrape(root, main_url);
}
public static void scrape(Node<String> node, String url) {
if(node.getParent() != null && (!driver.getCurrentUrl().equals(node.getParent().getData()))) {
driver.navigate().to(node.getParent().getData());
}
driver.navigate().to(url);
List<WebElement> allLinks = driver.findElements(By.tagName("a"));
for(WebElement link : allLinks) {
if(link.getAttribute("href").contains(main_url) && !uniqueLinks.contains(link.getAttribute("href")) && link.isDisplayed()) {
uniqueLinks.add(link.getAttribute("href"));
System.out.println(link.getAttribute("href"));
scrape(new Node<>(link.getAttribute("href")), link.getAttribute("href"));
}
}
}
}
这是控制台的输出:
D:Programmeopenjdk-12.0.1_windows-x64_binjdk-12.0.1binjava.exe "-javaagent:D:ProgrammeJetBrainsIntelliJ IDEA 2019.1.2libidea_rt.jar=60461:D:ProgrammeJetBrainsIntelliJ IDEA 2019.1.2bin" -Dfile.encoding=UTF-8 -classpath C:UsersadminDesktopSeleniumWebScraperoutproductionSeleniumWebScraper;D:Downloadsselenium-server-standalone-3.141.59.jar de.company.crawler.crawler
1557924446770 mozrunner::runner INFO Running command: "C:\Program Files\Mozilla Firefox\firefox.exe" "-marionette" "-foreground" "-no-remote" "-profile" "C:\Users\admin\AppData\Local\Temp\rust_mozprofile.YqmEqE8y1pjv"
1557924447037 addons.webextension.screenshots@mozilla.org WARN Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: mozillaAddons
1557924447037 addons.webextension.screenshots@mozilla.org WARN Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: resource://pdf.js/
1557924447037 addons.webextension.screenshots@mozilla.org WARN Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: about:reader*
1557924448047 Marionette INFO Listening on port 60468
1557924448383 Marionette WARN TLS certificate errors will be ignored for this session
Mai 15, 2019 2:47:28 NACHM. org.openqa.selenium.remote.ProtocolHandshake createSession
INFO: Detected dialect: W3C
JavaScript warning: https://robhammond.co/js/jquery.min.js, line 4: Using //@ to indicate sourceMappingURL pragmas is deprecated. Use //# instead
https://robhammond.co/tools/seo-crawler#content
https://twitter.com/intent/tweet?text=SEO%20Crawler&url=https://robhammond.co/tools/seo-crawler&via=robhammond
Exception in thread "main" org.openqa.selenium.StaleElementReferenceException: The element reference of <a href="/tools/"> is stale; either the element is no longer attached to the DOM, it is not in the current frame context, or the document has been refreshed
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/stale_element_reference.html
Build info: version: '3.141.59', revision: 'e82be7d358', time: '2018-11-14T08:25:53'
System info: host: 'DESKTOP-admin', ip: '192.168.233.1', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '12.0.1'
Driver info: org.openqa.selenium.firefox.FirefoxDriver
Capabilities {acceptInsecureCerts: true, browserName: firefox, browserVersion: 66.0.5, javascriptEnabled: true, moz:accessibilityChecks: false, moz:geckodriverVersion: 0.24.0, moz:headless: false, moz:processID: 19124, moz:profile: C:UsersadminAppDataLoca..., moz:shutdownTimeout: 60000, moz:useNonSpecCompliantPointerOrigin: false, moz:webdriverClick: true, pageLoadStrategy: normal, platform: WINDOWS, platformName: WINDOWS, platformVersion: 10.0, rotatable: false, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify}
Session ID: b3b87675-57c8-4b48-9a20-8df5e4d37503
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:481)
at org.openqa.selenium.remote.http.W3CHttpResponseCodec.createException(W3CHttpResponseCodec.java:187)
at org.openqa.selenium.remote.http.W3CHttpResponseCodec.decode(W3CHttpResponseCodec.java:122)
at org.openqa.selenium.remote.http.W3CHttpResponseCodec.decode(W3CHttpResponseCodec.java:49)
at org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:158)
at org.openqa.selenium.remote.service.DriverCommandExecutor.execute(DriverCommandExecutor.java:83)
at org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:552)
at org.openqa.selenium.remote.RemoteWebElement.execute(RemoteWebElement.java:285)
at org.openqa.selenium.remote.RemoteWebElement.getAttribute(RemoteWebElement.java:134)
at de.company.crawler.crawler.scrape(crawler.java:33)
at de.company.crawler.crawler.scrape(crawler.java:38)
at de.company.crawler.crawler.main(crawler.java:20)
Process finished with exit code 1
-
当您远离第一页时,
allLinks
列表中的所有Webelement都会丢失。我建议将其从Webelement列表转换为:
,例如:List<String> allLinksHrefs = allLinks.stream().map(link -> link.getAttribute("href")).collect(Collectors.toList());
,然后通过此新的
allLinksHrefs
列表进行迭代。 - 您可以使用基于哈希的集合来持有
uniqueLinks
,例如哈希集 - 这种方式将自动消除 - 当前的方法可能需要几天的时间才能完成,考虑使用硒网格并在并行运行刮板