循环浏览按字母顺序排列的页面(rvest)



在这个问题上花了很多时间,并查看了可用的答案后,我想继续问一个新问题,以解决我用R和rvest进行网络抓取的问题。我已经试图全面地阐述问题,以尽量减少的问题

问题我正试图从会议网页中提取作者姓名。作者按姓氏字母顺序分隔;因此,我需要使用for循环调用follow_link()25次,以转到每一页并提取相关的作者文本。

会议网站:https://gsa.confex.com/gsa/2016AM/webprogram/authora.html

我使用rvest在R中尝试了两种解决方案,但都存在问题。

解决方案1(联系电话)

lttrs <- LETTERS[seq( from = 1, to = 26 )] # create character vector
website <-  html_session(https://gsa.confex.com/gsa/2016AM/webprogram/authora.html)
tempList <- list() #create list to store each page's author information
for(i in 1:length(lttrs)){
tempList[[i]] <- website %>%
follow_link(lttrs[i])%>% #use capital letters to call links to author pages  
html_nodes(xpath ='//*[@class = "author"]') %>% 
html_text()  
}

此代码有效。。到了一定程度。以下是输出。它将成功地在字母页面中导航,直到H-I转换和L-M转换,在这一点上它抓住了错误的页面。

Navigating to authora.html
Navigating to authorb.html
Navigating to authorc.html
Navigating to authord.html
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authora.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to http://community.geosociety.org/gsa2016/home

解决方案2(对链接的CSS调用)使用页面上的CSS选择器,每个带字母的页面都被标识为";a: 第n个孩子(1-26)";。因此,我使用对该CSS标识符的调用重新构建了循环。

tempList <- list()
for(i in 2:length(lttrs)){
tempList[[i]] <- website %>%
follow_link(css = paste('a:nth-child(',i,')',sep = '')) %>%
html_nodes(xpath ='//*[@class = "author"]') %>% 
html_text()
}

这有点像。再次,它在某些转换中遇到问题(见下文)

Navigating to authora.html
Navigating to uploadlistall.html
Navigating to http://community.geosociety.org/gsa2016/home
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authori.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to authorm.html
Navigating to authorn.html
Navigating to authoro.html
Navigating to authorp.html
Navigating to authorq.html
Navigating to authorr.html
Navigating to authors.html
Navigating to authort.html
Navigating to authoru.html
Navigating to authorv.html
Navigating to authorw.html
Navigating to authorx.html
Navigating to authory.html
Navigating to authorz.html

具体来说,此方法会错过B、C和D。在此步骤中循环到不正确的页面。我将非常感谢任何关于如何重新配置我的上述代码以正确循环浏览所有26个字母表页面的见解或指导。

非常感谢!

欢迎来到SO(第一个问题值得称赞)。

你似乎非常幸运,因为该网站的robots.txt有很多条目,但没有试图限制你的行为。

我们可以用html_nodes(pg, "a[href^='author']")拉取页面底部字母表分页链接中的所有href。以下是所有作者的所有论文链接:

library(rvest)
library(tidyverse)
pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")
html_nodes(pg, "a[href^='author']") %>% 
html_attr("href") %>% 
sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 
{ pb <<- progress_estimated(length(.)) ; . } %>%  # we'll use a progress bar as this will take ~3m
map_df(~{
pb$tick()$print() # increment progress bar
Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay
read_html(.x) %>% 
html_nodes("div.item > div.author") %>% 
map_df(~{
data_frame(
author = html_text(.x, trim = TRUE),
paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 
html_text(trim = TRUE),
paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 
html_attr("href") %>% 
sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)
)
})
}) -> author_papers
author_papers
## # A tibble: 34,983 x 3
##    author               paper  paper_url                                                    
##    <chr>                <chr>  <chr>                                                        
##  1 Aadahl, Kristopher   296-5  https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html
##  2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html
##  3 Abbey, Alyssa        54-4   https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html
##  4 Abbott, Dallas H.    341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html
##  5 Abbott Jr., David M. 38-6   https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html
##  6 Abbott, Grant        58-7   https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html
##  7 Abbott, Jared        29-10  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html
##  8 Abbott, Jared        317-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html
##  9 Abbott, Kathryn A.   187-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html
## 10 Abbott, Lon D.       208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html
## # ... with 34,973 more rows

我不知道你需要从单独的纸页上得到什么,这样你才能做到。

您也不必等待~3m,因为author_papers数据帧在此RDS文件中:https://rud.is/dl/author-papers.rds你可以用阅读

readRDS(url("https://rud.is/dl/author-papers.rds"))

如果你确实计划刮34983张纸,那么请继续注意"不要粗鲁",并使用爬行延迟(参考:https://rud.is/b/2017/07/28/analyzing-wait-delay-settings-in-common-crawl-robots-txt-data-with-r/)。

更新

html_nodes(pg, "a[href^='author']") %>% 
html_attr("href") %>% 
sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 
{ pb <<- progress_estimated(length(.)) ; . } %>%  # we'll use a progress bar as this will take ~3m
map_df(~{
pb$tick()$print() # increment progress bar
Sys.sleep(5) # PLEASE leave this in. It's rude to hammer a site without a crawl delay
read_html(.x) %>% 
html_nodes("div.item > div.author") %>% 
map_df(~{
data_frame(
author = html_text(.x, trim = TRUE),
is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>% 
html_text(trim = TRUE) %>% # retrieve the text of all the "papers"
paste0(collapse=" ") %>% # just in case there are multiple nodes we flatten them into one
grepl("*", ., fixed=TRUE) # make it TRUE if we find the "*" 
)
})
}) -> author_with_presenter_status
author_with_presenter_status
## # A tibble: 22,545 x 2
##    author               is_presenting
##    <chr>                <lgl>        
##  1 Aadahl, Kristopher   FALSE        
##  2 Aanderud, Zachary T. FALSE        
##  3 Abbey, Alyssa        TRUE         
##  4 Abbott, Dallas H.    FALSE        
##  5 Abbott Jr., David M. TRUE         
##  6 Abbott, Grant        FALSE        
##  7 Abbott, Jared        FALSE        
##  8 Abbott, Kathryn A.   FALSE        
##  9 Abbott, Lon D.       FALSE        
## 10 Abbott, Mark B.      FALSE        
## # ... with 22,535 more rows

您也可以使用检索

readRDS(url("https://rud.is/dl/author-presenter.rds"))

最新更新