r语言 - 使用 rvest 将数据帧的 Webscrape 标题和列表



我想将此网页上的超链接抓取到具有下面显示的列的数据框中。源页面包含标题和链接列表。

  • 主题标题(问题(
  • 超链接标题(确定(
  • 超链接(正常(

获取链接和标题很简单(html_node"li"和"a"(。我不清楚如何将主题标题合并到最终数据帧中。

library(tidyverse)
library(rvest)
my.url <- read_html("http://www.secnav.navy.mil/fmc/fmb/Pages/Fiscal-Year-2019.aspx") %>% 
html_nodes("#sharePointMainContent") 
hyperlink.title <- my.url %>% 
html_nodes("li") %>% 
html_text()
hyperlink <- my.url %>% 
html_nodes("li") %>% 
html_nodes("a") %>% 
html_attr("href")
df <- tibble(title, hyperlink.title)

我可以成功抓取标题,但无法弄清楚如何将它们正确合并到最终数据帧中。

subject.heading <- my.url %>% 
html_nodes("h3") %>% 
html_text() %>% str_trim()

创建于 2018-09-03 由 reprex 包 (v0.2.0(.

该页面的结构很奇怪,主表内有表。

我发现有效的是迭代(map_df()(父表的单元格(由s4-wpcell-plain类标识(。每个单元格都包含另一个表,但我们可以简单地提取我们所追求的内容,而不是依赖html_table().

library(tidyverse)    
library(rvest)
#> Loading required package: xml2

r <- read_html("http://www.secnav.navy.mil/fmc/fmb/Pages/Fiscal-Year-2019.aspx") %>% 
html_node("#sharePointMainContent>div>table") %>% 
html_nodes(".s4-wpcell-plain") %>% 
map_df(~{
heading <- .x %>% html_nodes('h3') %>% html_text() %>% str_trim()
titles <- .x %>% html_nodes('li') %>% html_text()
links <- .x %>% html_nodes('a') %>% html_attr("href")
data_frame(heading, titles, links)
})
r
#> # A tibble: 21 x 3
#>    heading                        titles                 links            
#>    <chr>                          <chr>                  <chr>            
#>  1 DEPARTMENT OF THE NAVY SUMMARY FY 19 DON Press Brief  http://www.secna…
#>  2 DEPARTMENT OF THE NAVY SUMMARY Supporting Exhibits    http://www.secna…
#>  3 DEPARTMENT OF THE NAVY SUMMARY Budget Highlights Book http://www.secna…
#>  4 DEPARTMENT OF THE NAVY SUMMARY The Bottom Line        http://www.secna…
#>  5 DEPARTMENT OF THE NAVY SUMMARY Report to Congress on… http://www.secna…
#>  6 DEPARTMENT OF THE NAVY SUMMARY Ship Building Plan SE… http://www.secna…
#>  7 MILITARY PERSONNEL PROGRAMS    Military Personnel, N… http://www.secna…
#>  8 MILITARY PERSONNEL PROGRAMS    Military Personnel, M… http://www.secna…
#>  9 MILITARY PERSONNEL PROGRAMS    Reserve Personnel, Na… http://www.secna…
#> 10 MILITARY PERSONNEL PROGRAMS    Reserve Personnel, Ma… http://www.secna…
#> # ... with 11 more rows

创建于 2018-09-04 由 reprex 包 (v0.2.0(.

最新更新