r语言 - 网页抓取网站的多个级别 - r - Web scraping multiple levels of a website 小贝子编程网

我想抓取一个网站。然后，对于每个抓取的项目，我想在子网页上抓取更多信息。作为一个例子，我将使用IMDB网站。我正在使用谷歌浏览器中的rvest包和选择器小工具。

从IMDB网站上，我可以获得评分最高的250部电视节目，如下所示：

library('rvest')
# url to be scrapped
url <- 'http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2'
#Reading the HTML code from the website
webpage <- read_html(url)
#Using CSS selectors to scrap
movies_html <- html_nodes(webpage,'.titleColumn a')
#Converting the TV show data to text
movies <- html_text(movies_html)
head(movies)
[1] "Planet Earth II"  "Band of Brothers" "Planet Earth"     "Game of Thrones"  "Breaking Bad"     "The Wire"

列表中的前 250 部电影中的每一部都是一个可单击的链接，提供有关每部电影的其他信息。在这种情况下，对于movies中的每部电影，我还想刮掉演员并将其存储在另一个list中。例如，如果您单击第二部到顶部的电影"兄弟连"并向下滚动，则演员阵容由~40人组成，从斯科特·格莱姆斯到菲尔·麦基。

我想做的伪代码：

for(i in movies) {
  url <- 'http://www.imdb.com/chart/toptv/i'
  webpage <- read_html(url)
  cast_html <- html_nodes(webpage,'#titleCast .itemprop')
  castList<- html_text(cast_html)
}

我相信这很简单，但它对我来说是新的，我不知道如何搜索正确的术语来找到解决方案。

如果我理解正确，你正在寻找一种方法

从前 250 个 ( main_url ( 中识别电影页面的 URL
获取前 250 个节目的标题 ( m_titles (
访问这些网址 ( m_urls (
提取这些电视节目的演员表 ( m_cast (

正确？

我们将首先定义一个从电视节目页面中提取演员表的函数：

getcast <- function(url){
  page <- read_html(url)
  nodes <- html_nodes(page, '#titleCast .itemprop')
  cast <- html_text(nodes)
  inds <- seq(from=2, to=length(cast), by=2)
  cast <- cast[inds]
  return(cast)
}

有了这个，我们可以确定第 1 点到第 4 点：

# Open main_url and navigate to interesting part of the page:
main_url <- "http://www.imdb.com/chart/toptv/?ref_=nv_tp_tv250_2"
main_page <- read_html(url)
movies_html <- html_nodes(main_page, '.titleColumn a')
# From the interesting part, get the titles and URLs:
m_titles <- html_text(movies_html)
sub_urls <- html_attr(movies_html, 'href')
m_urls <- paste0('http://www.imdb.com', sub_urls)
# Use `getcast()` to extract movie cast from every URL in `m_urls`
m_cast <- lapply(m_urls, getcast)

r语言 - 网页抓取网站的多个级别

相关内容

最新更新

热门标签：