使用来自IMDB的rvest抓取多个页面

  • 本文关键字:抓取 rvest IMDB r
  • 更新时间 :
  • 英文 :


因此,我试图从IMDB链接中抓取数据:https://www.imdb.com/search/title?release_date=2010-01-012017-12-31&count=100&start=101&ref_=adv_prv

我想用下面的代码来抓取运行时和标题数据。然而,我想知道如何对其他多个页面执行相同的操作?我试着做一个for循环,但我不知道如何将它合并到我的代码中。模式如下:

https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=101&ref_=adv_prv
https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=201&ref_=adv_nxt
https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=301&ref_=adv_nxt

我的代码:

url <- 'https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=101&ref_=adv_prv'
webpage <- read_html(url)
titlehtml <- html_nodes(webpage,'.lister-item-header a')
title <- html_text(titlehtml)

runtimehtml <- html_nodes(webpage,'.text-muted .runtime')
runtime <- html_text(runtimehtml)
runtime<-gsub(" min","",runtime)# removing mins and converting it to numerical
runtime<-as.numeric(runtime)

试试这个:

urls <- c("https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=101&ref_=adv_prv",
"https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=201&ref_=adv_nxt",
"https://www.imdb.com/search/title?release_date=2010-01-01,2017-12-31&count=100&start=301&ref_=adv_nxt")`
results_list <- list()
for(.page in seq_along(urls)){
webpage <- read_html(urls[[.page]])
titlehtml <- html_nodes(webpage,'.lister-item-header a')
title <- html_text(titlehtml)
runtimehtml <- html_nodes(webpage,'.text-muted .runtime')
runtime <- html_text(runtimehtml)
runtime <- gsub(" min","",runtime)
results_list[[.page]] <- data.frame(title = title,
runtime = as.numeric(runtime)
)
}
final_results <- plyr::ldply(results_list)

最新更新