r语言 - 从地图中抓取 PDF 文件



我一直在尝试按照此代码下载嵌入地图中的pdf(原始代码可以在这里找到(。每个pdf都是指巴西的一个自治市(5,570个文件(。

library(XML)
library(RCurl)
url <- "http://simec.mec.gov.br/sase/sase_mapas.php?uf=RJ&tipoinfo=1"
page   <- getURL(url)
parsed <- htmlParse(page)
links  <- xpathSApply(parsed, path="//a", xmlGetAttr, "href")
inds   <- grep("*.pdf", links)
links  <- links[inds]
regex_match <- regexpr("[^/]+$", links, perl=TRUE)
destination <- regmatches(links, regex_match)
for(i in seq_along(links)){
download.file(links[i], destfile=destination[i])
Sys.sleep(runif(1, 1, 5))
}

我已经在其他项目中多次使用过这段代码并且它有效。对于这种特定情况,它没有。事实上,我已经尝试了很多方法来抓取这些文件,但这对我来说似乎是不可能的。最近,我收到了以下链接。然后可以将 uf(州(和 muncod(市政代码(结合起来下载文件,但我不知道如何将其包含在代码中。

http://simec.mec.gov.br/sase/sase_mapas.php?uf=MT&muncod=5100102&acao=download

提前感谢!

devtools::install_github("ropensci/RSelenium")
library(rvest)
library(httr)
library(RSelenium)
# connect to selenium server from within r (REPLACE SERVER ADDRESS)
rem_dr <- remoteDriver(
remoteServerAddr = "192.168.50.25", port = 4445L, browserName = "firefox"
)
rem_dr$open()
# get the two-digit state codes for brazil by scraping the below webpage
tables <- "https://en.wikipedia.org/wiki/States_of_Brazil" %>%
read_html() %>%
html_table(fill = T)
states <- tables[[4]]$Abbreviation
# for each state, we are going to go navigate to the map of that state using
# selenium, then scrape the list of possible municipality codes from the drop
# down menu present in the map
get_munip_codes <- function(state) {
url <- paste0("http://simec.mec.gov.br/sase/sase_mapas.php?uf=", state)
rem_dr$navigate(url)
# have to wait until the drop down menu loads. 8 seconds will be enough time
# for each state
Sys.sleep(8)
src <- rem_dr$getPageSource()
out <- read_html(src[[1]]) %>%
html_nodes(xpath = "//select[@id='muncod']/option[boolean(@value)]") %>%
xml_attrs("value") %>%
unlist(use.names = F)
print(state)
out
}
state_munip <- sapply(
states, get_munip_codes, USE.NAMES = TRUE, simplify = FALSE
)
# now you can download each pdf. first create a directory for each state, where
# the pdfs for that state will go:
lapply(names(state_munip), function(x) dir.create(file.path("brazil-pdfs", x)))
# ...then loop over each state/municipality code and download the pdf
lapply(
names(state_munip), function(state) {
lapply(state_munip[[state]], function(munip) {
url <- sprintf(
"http://simec.mec.gov.br/sase/sase_mapas.php?uf=%s&muncod=%s&acao=download",
state, munip
)
file <- file.path("brazil-pdfs", state, paste0(munip, ".pdf"))
this_one <- paste0("state ", state, ", munip ", munip)
tryCatch({
GET(url, write_disk(file, overwrite = TRUE))
print(paste0(this_one, " downloaded"))
},
error = function(e) {
print(paste0("couldn't download ", this_one))
try(unlink(file, force = TRUE))
}
)
})
}
)

步骤:

  1. 获取窗口计算机的 IP 地址(请参阅 https://www.digitalcitizen.life/find-ip-address-windows(

  2. 通过运行以下命令启动Selenium服务器Docker容器: docker run -d -p 4445:4444 selenium/standalone-firefox:2.53.1

  3. 通过运行以下命令启动 rocker/tidyverse docker 容器: docker run -v `pwd`/brazil-pdfs:/home/rstudio/brazil-pdfs -dp 8787:8787 rocker/tidyverse

  4. 进入您的首选浏览器并输入此地址:http://localhost:8787 ...这将带您进入rstudio服务器的登录屏幕。使用用户名"rstudio"和密码"rstudio"登录

  5. 将上面显示的代码复制/粘贴到新的 Rstudio 中。R 文档。将remoteServerAddr的值替换为在步骤 1 中找到的 IP 地址。

  6. 运行代码...这应该将PDF写入目录"Brazil-pdfs",该目录既位于容器内部,又映射到Windows计算机(换句话说,PDF也会显示在本地计算机上的Brazil-PDFs DIR中(。请注意,运行代码需要一段时间 b/c有很多PDF。

相关内容

  • 没有找到相关文章

最新更新