使用rest的网页抓取:登录问题



我想从以下网站抓取数据,这需要我先登录:http://nationallizenzen-1zbw-1eu-137hhmrga01e8.zugang.nationallizenzen.de/zbwhtml/10836/55627/Country%20Report%20Austria%20December%202017.html?page=full.

我尝试使用以下博客条目的建议来模拟登录:https://riptutorial.com/r/example/23955/using-rvest-when-login-is-required

我的代码如下:

user.name <- "ABC12345"
pw <- "abcd1234"
url.html <- "http://nationallizenzen-1zbw-1eu-137hhmrga01e8.zugang.nationallizenzen.de/zbwhtml/10836/55627/Country%20Report%20Austria%20December%202017.html?page=full"

pgsession <- html_session(url.html)

pgform <- html_form(pgsession)[[1]]
filled_form <- set_values(
pgform,
username = user.name,
password = pw
)

当我在搜索栏中输入上面的URL时,我被重定向到以下网站:https://login.nationallizenzen.de/idp/profile/SAML2/Redirect/SSO?execution=e1s2。

从该网站的HTML源代码中我了解到,要填写的登录字段称为"用户名";和"password".

对象pgsession是一个嵌套很重的列表。我广泛地搜索了它,但它似乎根本不像我的目标网站的HTML源代码,我没有找到任何这些字段。

由于我对HTML完全陌生,所以我真的不明白这里发生了什么。我的怀疑是,我的网络浏览器被定向到另一个网站,这不是由Rs模拟浏览器会话复制。特别是,当我键入pgform[["url"]]时,URL与上面所述的URL不匹配(它以"e1s1"结尾)。而不是"e1s2")。

如果你有任何解决方案,提示或建议,我将非常感激。由于我对HTML的不了解,我觉得在这里有点迷失。

最诚挚的问候,托拜厄斯

我已经能够用以下代码填写表单并按下提交按钮:

library(RSelenium)
shell('docker run -d -p 4445:4444 selenium/standalone-firefox')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "firefox")
remDr$open()
remDr$navigate("http://nationallizenzen-1zbw-1eu-137hhmrga01e8.zugang.nationallizenzen.de/zbwhtml/10836/55627/Country%20Report%20Austria%20December%202017.html?page=full")
web_Obj_Username <- remDr$findElement("id", "username")
web_Obj_Username$sendKeysToElement(list("my_user_name"))
remDr$screenshot(display = TRUE, useViewer = TRUE) 
web_Obj_Username <- remDr$findElement("id", "password")
web_Obj_Username$sendKeysToElement(list("my_password"))
remDr$screenshot(display = TRUE, useViewer = TRUE) 
web_Obj_Button <- remDr$findElement("xpath", "/html/body/div/div/div/div[1]/form/div[5]/button") # Submit button
web_Obj_Button$clickElement()
remDr$screenshot(display = TRUE, useViewer = TRUE) 

这是另一种方法:

library(RDCOMClient)
url <- "http://nationallizenzen-1zbw-1eu-137hhmrga01e8.zugang.nationallizenzen.de/zbwhtml/10836/55627/Country%20Report%20Austria%20December%202017.html?page=full"
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
IEApp$Navigate(url)
web_Obj_Username <- IEApp$Document()$getElementByID("username")
web_Obj_Username$Click()
web_Obj_Username$Focus()
web_Obj_Username[["Value"]] <- "my_user_name"
web_Obj_Username <- IEApp$Document()$getElementByID("password")
web_Obj_Username$Click()
web_Obj_Username$Focus()
web_Obj_Username[["Value"]] <- "my_password"
doc <- IEApp$Document()
clickEvent <- doc$createEvent("MouseEvent")
clickEvent$initEvent("click", TRUE, FALSE)
web_Obj_Login <- IEApp$Document()$getElementsByClassName("form-element form-button")
web_Obj_Login$Item(0)$dispatchEvent(clickEvent)

最新更新