r-rvest从网页中抓取链接



我正在使用rvest从杂志《骗局》中抓取一些链接。我用过这个代码

library(rvest)
page <- read_html("https://thehustle.co/daily/page/33/") %>% 
  html_nodes(".daily-article-title") %>% 
  html_attr('href')

然而,这会返回一个包含30个NA的向量。我使用SelectorGadget来查找类,所以不确定这里出了什么问题。

链接位于类'.daily-article-title'的上方。以下是获取标题和相应链接的方法。

library(rvest)
webpage <- read_html("https://thehustle.co/daily/page/33/")
webpage %>%
  html_nodes("h3.daily-article-title") %>% 
  html_text() -> title
title
# [1] "nApple buys itself a $400m Christmas presentn"          
# [2] "nSan Francisco wages war on robotsn"                    
# [3] "nThe US could lose its greatest exportn"                
# [4] "n"Mom, where do podcasts come from?"n"                
# [5] "nTencent Music to team up with Spotify?n"               
# [6] "nFirst rule of the Farmers Business Network?n"          
# [7] "nSpiegel goes HAM on social median"                     
# [8] "nBanks won't take weed companies’ cashn"                
# [9] "nThe Koch bros just took a $650m stake in Timen"        
#[10] "n4 mins to smarter Monday smalltalkn"     
#...
#...
           
webpage %>%
  html_nodes("[class='col-md-12 daily-wrap clearfix'] a") %>%
  html_attr('href') -> link
# [1] "https://thehustle.co/apple-christmas-present"                          
# [2] "https://thehustle.co/war-on-robots"                                    
# [3] "https://thehustle.co/big-data-trade-nafta-daily"                       
# [4] "https://thehustle.co/apple-podcast-market"                             
# [5] "https://thehustle.co/tencent-spotify-truce"                            
# [6] "https://thehustle.co/first-rule-of-farmers"                            
# [7] "https://thehustle.co/snap-anti-facebook"                               
# [8] "https://thehustle.co/weed-banking"                                     
# [9] "https://thehustle.co/pepshi-bros"                                      
#[10] "https://thehustle.co/rundown"    
#...
#...                              

相关内容

  • 没有找到相关文章