我遇到麻烦了,需要帮助。
我有链接列表(约9000个链接),我正在循环运行,并在每个上做一些过程
链接是这样的:-
link1link2link3link4…..link9000
但我面临的麻烦,因为有时链接2失败(超时),有时link2工作和400或任何随机链接失败作为超时。有没有办法让我再试一次链接失败?我添加了:-
status_c <- httr::GET(Links, config = httr::config(connecttimeout = 150))
但我还是得到了暂停。请帮帮我!或者有什么建议吗?Final_links_bind =拥有所有链接列表一些示例链接:-
https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711
https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703
https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789
for(i in 1:nrow(final_links_bind)) {
Links <- final_links_bind[i,]
BP_ID <- final_bp_bind[i,]
#print(Links)
status_c <- GET(Links,timeout(120))
status <- status_code(status_c)
if(status == "200"){
url_parse<- read_html(Links)
col_name<- url_parse %>%
html_nodes("tr") %>%
html_text()
col_name <- stringr::str_remove_all(col_name, "\t|\n|\r")
pattern_col_no <- grep("využití", col_name)
col_name <- as.data.frame(col_name)
method_selected <- col_name[pattern_col_no,]
WRITE_CSV_DATA <- rbind(WRITE_CSV_DATA, data.frame(BP_ID = c(BP_ID), method_selected = c(method_selected), Links = c(Links)))
#METHOD_OF_USE <- rbind(method_selected,METHOD_OF_USE)
print(WRITE_CSV_DATA)
}else{
print("LINK NOT WORKING")
no_Links <- sorted_link[i,]
not_working_link <- rbind(not_working_link,no_Links)
}
}
不清楚您想要怎样的最终输出,但是这里是如何抓取和跳过不起作用的链接
library(rvest)
library(httr2)
library(tidyverse)
给定链接数据帧,注意第三个数据帧不起作用:
df <- tibble(
links = c(
"https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711",
"https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703",
"https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/9999999",
"https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789"
)
)
# A tibble: 4 × 1
links
<chr>
1 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711
2 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703
3 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/9999999
4 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789
创建一个函数来抓取表,特别是第三行:
get_info <- function(link) {
cat("Scraping", link, "n")
link %>%
read_html() %>%
html_table() %>%
pluck(2) %>%
slice(3) %>%
pull(2)
}
和mutate()
是包含信息的新列,如果链接不工作则为NA。如果链接不工作,possibly()
将抛出NA (NA_character_
)返回,而不是停止代码。
df %>%
mutate(
info = map_chr(links, possibly(get_info, otherwise = NA_character_))
)
# A tibble: 4 × 2
links info
<chr> <chr>
1 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146711 rodinný dům
2 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146703 rodinný dům
3 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/9999999 NA
4 https://vdp.cuzk.cz/vdp/ruian/stavebniobjekty/2146789 rodinný dům