R Selenium(或rvest):如何抓取主页中列出的子(子)页面中的表格



RSelenium

我经常需要抓取和分析医疗保健合同的公共数据,并在 VBA 中部分自动化。 尽管我昨晚花了很长时间尝试设置 RSelenium,但我应该得到一些缺点,成功地启动了服务器并运行了一些将单个表复制到数据帧的示例。 我是网络抓取的初学者。

我正在使用一个动态生成的网站。 https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=15&ServiceType=03&Code=&Name=&City=&Nip=&Regon=&Product=&OrthopedicSupply=false

我处理三个级别的页面:

级别 1

我的首页具有以下结构(A列包含链接,底部有页面):

========
A, B, C
link_A,15,10
link_B,23,12
link_c,21,12
link_D,32,12
========
1,2,3,4,5,6,7,8,9,...
======================

我刚刚学习了选择器小工具,它指示:

桌子

.table-striped

1.2.3.4.5.6.7

.pagination-container

级别 2在表格中的每个链接(link_A、link_B)下都有一个包含表格的子页面。示例:https://aplikacje.nfz.gov.pl/umowy/Agreements/GetAgreements?ROK=2017&ServiceType=03&ProviderId=20799&OW=15&OrthopedicSupply=False&Code=150000009

============
F, G, H
link_agreements,34,23
link_agreements,23,23
link_agreements,24,24
============

选择器小工具指示

.table-striped

级别 3同样,在每个链接(link_agreements)下还有另一个子页面,其中包含我要收集的数据 https://aplikacje.nfz.gov.pl/umowy/AgreementsPlan/GetPlans?ROK=2017&ServiceType=03&ProviderId=20799&OW=15&OrthopedicSupply=False&Code=150000009&AgreementTechnicalCode=761176

============
X,Y,Z
orthopedics, 231,323
traumatology, 323,248
hematology, 323,122

同样,选择器小工具指示

.table-striped

我想迭代地将所有子页面收集到数据框中,如下所示:

来自首页的信息;来自子子页面的信息

link_A (from top page);15 (Value from A column), ortopedics, 231,323
link_A (from top page);15 (Value from A column), traumatology,323,248
link_A (from top page);15 (Value from A column), traumatology,323,122

有没有一本食谱,一些关于R selenium或rvest的好例子来展示,如何遍历表中的链接并将子(子)页面中的数据获取到数据帧中? 我将不胜感激任何信息,示例,任何提示一本书,说明如何使用RSelenium或任何其他抓取包进行操作。

附言警告:我也遇到了此页面的SSL无效cretificate问题,我正在使用Firefox硒驱动程序。所以每次我手动需要跳过警告 - 对于另一个主题。

附言到目前为止,我尝试的代码发现是死胡同。

install.packages("RSelenium")
install.packages("wdman")
library(RSelenium)   

图书馆(WDMAN) 库(XML)

接下来我开始硒,我立即遇到了"java 8 存在,java 7 需要通过删除所有 java 来解决的问题?"的问题。exe 文件 wrom Windows/System32 或 SysWOW64

library(wdman)
library(XML)
selServ <- selenium(verbose = TRUE) #installs selenium

selServ$process

remDr <- remoteDriver(remoteServerAddr = "localhost"
, port = 4567
, browserName = "firefox")

remDr$open(silent = F)
remDr$navigate("https://aplikacje.nfz.gov.pl/umowy/AgreementsPlan/GetPlans?ROK=2017&ServiceType=03&ProviderId=17480&OW=13&OrthopedicSupply=False&Code=130000111&AgreementTechnicalCode=773979")

webElem <- remDr$findElement(using = "class name", value = "table-striped")

webElemtxt <- webElem$getElementAttribute("outerHTML")[[1]]
table <- readHTMLTable(webElemtxt, header=FALSE, as.data.frame=TRUE,)[[1]]

webElem$clickElement()
webElem$sendKeysToElement(list(key="tab",key="enter"))

在这里,我与硒的斗争结束了。我无法将密钥发送到Chrome,无法使用Firefox,因为它需要正确的SSL证书,而我无法有效地绕过它。

table<-0
library(rvest)
# PRIMARY TABLE EXTRACTION
for (i in 1:10){
url<-paste0("https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=15&ServiceType=03&OrthopedicSupply=False&page=",i)
page<-html_session(url)
table[i]<-html_table(page)
}
library(data.table)
primary_table<-rbindlist(table,fill=TRUE)
# DATA CLEANING REQUIRED IN PRIMARY TABLE to clean the the variable 
# `Kod Sortuj według kodu świadczeniodawcy`
# Clean and store it in the primary_Table_column only then secondary table extraction will work
#SECONDARY TABLE EXTRACTION
for (i in 1:10){
url<-paste0("https://aplikacje.nfz.gov.pl/umowy/Agreements/GetAgreements?ROK=2017&ServiceType=03&ProviderId=20795&OW=15&OrthopedicSupply=False&Code=",primary_table[i,2])
page<-html_session(url)
table[i]<-html_table(page)
# This is the key where you can identify the whose secondary table is this.
table[i][[1]][1,1]<-primary_table[i,2]
}
secondary_table<-rbindlist(table,fill=TRUE)

这是我基于 hbmstr aid 开发的答案: rvest:使用 url 而不是文本提取表 实际上,致敬是向他致敬的。我修改了他的代码来处理子页面。我也感谢巴拉特。我的代码有效,但可能非常不整洁。希望它能适应其他人。随意简化代码,提出更改。

library(rvest)
library(tidyverse)
library(stringr)
# error: Peer certificate cannot be authenticated with given CA certificates 
# https://stackoverflow.com/questions/40397932/r-peer-certificate-cannot-be-authenticated-with-given-ca-certificates-windows
library(httr)
set_config(config(ssl_verifypeer = 0L))
# Helpers
# First based on https://stackoverflow.com/questions/35947123/r-stringr-extract-number-after-specific-string
# str_extract(myStr, "(?i)(?<=ProviderID\D)\d+")
get_id <-
function (x, myString) {
require(stringr)
str_extract(x, paste0("(?i)(?<=", myString, "\D)\d+"))
}

rm_extra <- function(x) { gsub("r.*$", "", x) }
mk_gd_col_names <- function(x) {
tolower(x) %>%
gsub(" +", "_", .)
}
URL <- "https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=15&ServiceType=03&OrthopedicSupply=False&page=%d"
get_table <- function(page_num = 1) {
pg <- read_html(httr::GET(sprintf(URL, page_num)))
tab <- html_nodes(pg, "table")
html_table(tab)[[1]][,-c(1,11)] %>%
set_names(rm_extra(colnames(.) %>% mk_gd_col_names)) %>%
mutate_all(funs(rm_extra)) %>%
mutate(link = html_nodes(tab, xpath=".//td[2]/a") %>% html_attr("href")) %>%
mutate(provider_id=get_id(link,"ProviderID"))  %>%
as_tibble()
}
pb <- progress_estimated(10)
map_df(1:10, function(i) {
pb$tick()$print()
get_table(page_num = i)
}) -> full_df
#===========level 2===============
# %26 escapes "&"
URL2a <- "https://aplikacje.nfz.gov.pl/umowy/Agreements/GetAgreements?ROK=2017&ServiceType=03&ProviderId="
URL2b <- "&OW=15&OrthopedicSupply=False&Code="
paste0(URL2a,full_df[1,11],URL2b,full_df[1,1])

get_table2 <- function(page_num = 1) {
pg <- read_html(httr::GET(paste0(URL2a,full_df[page_num,11],URL2b,full_df[page_num,1])))
tab <- html_nodes(pg, "table")
html_table(tab)[[1]][,-c(1,8)] %>%
set_names(rm_extra(colnames(.) %>% mk_gd_col_names)) %>%
mutate_all(funs(rm_extra)) %>%
mutate(link = html_nodes(tab, xpath=".//td[2]/a") %>% html_attr("href")) %>%
mutate(provider_id=get_id(link,"ProviderID"))  %>%
mutate(technical_code=get_id(link,"AgreementTechnicalCode"))  %>%
as_tibble()
}
pb <- progress_estimated(nrow(full_df))
map_df(1:nrow(full_df), function(i) {
pb$tick()$print()
get_table2(page_num = i)
}) -> full_df2
#===========level 3===============
URL3a <- "https://aplikacje.nfz.gov.pl/umowy/AgreementsPlan/GetPlans?ROK=2017&ServiceType=03&ProviderId="
URL3b <- "&OW=15&OrthopedicSupply=False&Code=150000001&AgreementTechnicalCode="
paste0(URL3a,full_df2[1,8],URL3b,full_df2[1,9])
get_table3 <- function(page_num = 1) {
pg <- read_html(httr::GET(paste0(paste0(URL3a,full_df2[page_num,8],URL3b,full_df2[page_num,9]))))
tab <- html_nodes(pg, "table")
provider <- as.numeric(full_df2[page_num,8])
html_table(tab)[[1]][,-c(1,8)] %>%
set_names(rm_extra(colnames(.) %>% mk_gd_col_names)) %>%
mutate_all(funs(rm_extra)) %>%
mutate(provider_id=provider)  %>%
as_tibble()
}
pb <- progress_estimated(nrow(full_df2)+1)
map_df(1:nrow(full_df2), function(i)  {
pb$tick()$print()
get_table3(page_num = i)
} ) -> full_df3

最新更新