r-如何处理抓取时的HTTP错误503,即使在合并Sys.sleep()之后



我正试图从国会记录中摘录演讲稿。我已经写了以下代码来做到这一点:

# install the following packages
library(rvest)
library(dplyr)
# write a function to extract text of speeches
get_text = function(speech_link) {
speech_page = read_html(speech_link)
speech_text = speech_page %>% html_nodes(".styled") %>%
html_text() %>% paste(collapse = ",")
return(speech_text)
}
# create empty df 
speeches = data.frame()
# for loop to extract speech heading, date, and text
for(page_result in seq(from = 1, to = 25, by = 1)) {
link = paste0(
"https://www.congress.gov/search?pageSort=issueAsc&q=%7B%22source%22%3A%22congrecord%22%2C%22search%22%3A%22covid%22%2C%22chamber%22%3A%22House%22%2C%22congress%22%3A%5B%22117%22%2C%22116%22%5D%7D&pageSize=100&page=",
page_result ,
""
)

page = read_html(link)

heading = page %>% html_nodes(".congressional-record-heading a") %>% html_text()
speech_links = page %>% html_nodes(".congressional-record-heading a") %>%
html_attr("href") %>% paste("https://www.congress.gov", ., sep = "")
date = page %>% html_nodes(".congressional-record-heading+ .result-item span") %>% html_text()
text = sapply(speech_links, FUN = get_text)
Sys.sleep(5)

speeches = rbind(speeches,
data.frame(heading, date, text, stringsAsFactors = FALSE))

print(paste("Page:", page_result))


}

这一直有效,直到我读到第十页左右,这时我收到了以下错误。现在我连一页都刮不动了。我想我已经淹没了网站。

Error in open.connection(x, "rb") : HTTP error 503.

在阅读了之前的文章之后,我将Sys.sleep()纳入了我的循环中,正如您在上面看到的那样,但这并没有什么不同。我在这里做错了什么?如有任何反馈,我们将不胜感激。

编辑:以下是我根据下面的评论所做的,但仍然无效

# write a function to extract text of speeches
get_text = function(speech_link) {
speech_page = read_html(speech_link)
Sys.sleep(1)
speech_text = speech_page %>% html_nodes(".styled") %>%
html_text() %>% paste(collapse = ",")
return(speech_text)
}
# create empty df 
speeches = data.frame()
# for loop to extract speech heading, date, and text
for(page_result in seq(from = 11, to = 25, by = 1)) {
link = paste0(
"https://www.congress.gov/search?pageSort=issueAsc&q=%7B%22source%22%3A%22congrecord%22%2C%22search%22%3A%22covid%22%2C%22chamber%22%3A%22House%22%2C%22congress%22%3A%5B%22117%22%2C%22116%22%5D%7D&pageSize=100&page=",
page_result ,
""
)
Sys.sleep(1)
page = tryCatch(read_html(link), error = function(e){NA})


heading = page %>% html_nodes(".congressional-record-heading a") %>% html_text()
speech_links = page %>% html_nodes(".congressional-record-heading a") %>%
html_attr("href") %>% paste("https://www.congress.gov", ., sep = "")
date = page %>% html_nodes(".congressional-record-heading+ .result-item span") %>% html_text()
text = sapply(speech_links, FUN = get_text)
speeches = rbind(speeches,
data.frame(heading, date, text, stringsAsFactors = FALSE))

print(paste("Page:", page_result))


}

事实上,我现在得到了这个新错误:

Error in open.connection(x, "rb") : HTTP error 503.
In addition: Warning message:
In .Internal(get(x, envir, mode, inherits)) :
closing unused connection 3 (https://www.congress.gov/congressional-record/2020/12/31/house-section/article/h9169-2?q=%7B%22search%22%3A%5B%22covid%22%2C%22covid%22%5D%7D&s=1&r=1071)

我无法重现您的错误,但我认为可能对您有所帮助的是保存源html,而不是直接处理它。这比您的方法有两个主要优势:

  1. 您实际上可以查看源html,看看它是否下载正确,如果下载不正确,请详细查看问题所在。

  2. 你不必向服务器索要10次相同的数据,这会给它带来越来越大的压力,因为你已经有了一些数据,你可以跳过已经保存的html,继续昨天或几个小时前停止的操作。

setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
getwd()
## empty potential rests from other scripts 
rm(list = ls())
# create a directory for the index pages
if(!dir.exists('index')) dir.create('index')
folder <- 'index'
# create vector with all links
links <- str_c("https://www.congress.gov/search?pageSort=issueAsc&q=%7B%22source%22%3A%22congrecord%22%2C%22search%22%3A%22covid%22%2C%22chamber%22%3A%22House%22%2C%22congress%22%3A%5B%22117%22%2C%22116%22%5D%7D&pageSize=100&page=",1:25)

# loop over it and save pages
i <- 1
for(link in links){
if(!file.exists(file.path(folder,str_c('page_',i,".html")))){
httr::GET(link, httr::write_disk(file.path(folder,str_c('page_',i,".html"))))
i = i + 1
}
}
# get the source file name
index_files <- list.files(folder, full.names = T)
# process and parse them
speeches <- tibble()
for (file in index_files) {
page <- rvest::read_html(file)
heading = page %>% html_nodes(".congressional-record-heading a") %>% html_text()
speech_links = page %>% html_nodes(".congressional-record-heading a") %>%
html_attr("href") %>% paste("https://www.congress.gov", ., sep = "")
date = page %>% html_nodes(".congressional-record-heading+ .result-item span") %>% html_text()
speeches = bind_rows(speeches,
data.frame(heading, date, speech_links, stringsAsFactors = FALSE))

}
# etc. 
if(!dir.exists('text')) dir.create('text')
folder <- 'text'
speeches <- speeches %>% 
mutate(id = row_number())
for(i in 1:nrow(speeches)){
if(!file.exists(file.path(folder, str_c('text_',speeches$id[i],".html")))){
GET(speeches$speech_links[i], 
write_disk(file.path(folder, str_c('text_',speeches$id[i],".html"))))
}
}

text_files <- list.files(folder, full.names = T)
# now parse all the text files in a similar fashion... 

相关内容

最新更新