RVEST忽略不存在的URL并继续抓取

我是web抓取和rvest包的新手。我想完成的是从以下网站浏览新闻内容：http://www.xwlbo.com/31035.html

我注意到历史新闻有数字索引的模式，但后来我发现数字索引是随机的，没有明确的规则，因此，可能有不存在的网页，我得到了Error in open.connection(x, "rb") : HTTP error 404.的错误。我怎么能忽略空的网页，继续使用现有的网页呢。

以下是我到目前为止提出的内容：

library(tidyverse)
library(lubridate)
library(stringr)
library(rvest)
Sys.setlocale(category="LC_ALL",locale="chinese")
web_index_number <- 4058:31106
urls <- str_c("http://www.xwlbo.com/",web_index_number,".html")

news_collect <- function(x){
webpage <- read_html(x)
wp_title <- html_node(webpage,'h2') %>% 
html_text()
wp_content <- html_nodes(webpage,'p , a , h2') %>% 
html_text()
len <- length(wp_content)-3
wp_content <- wp_content[1:len]
wp_title <- rep(wp_title,len)
news <- data.frame(wp_title,wp_content)}
news_collection <- map_df(urls,news_collect)

您可以使用trycatch结构，尝试执行以news_collect开头的代码。如果read_html(x(失败，您可以直接写入错误代码以打印错误并返回NULL。

相关内容

最新更新

热门标签：