R - 从 HTML 页面抓取 RVEST 网页



我正在尝试从此页面中抓取博彩公司的赔率:

https://www.interwetten.com/en/sportsbook/top-leagues?topLinkId=1

所以我到目前为止写了以下代码

interwetten <- read_html("https://www.interwetten.com/en/sportsbook/top-leagues?topLinkId=1") 
bundesliga <- html_nodes(interwetten, xpath = '//*[@id="TBL_Content_1019"]')  
bundesliga_teams <- html_nodes(bundesliga, "span")

现在我得到的输出是:

[1] <span id="ctl00_cphMain_UCOffer_LeagueList_rptLeague_ctl00_ucBettingContainer_lblClose" clas ...
[2] <span itemscope="itemscope" itemprop="location" itemtype="http://schema.org/Place"><meta ite ...
[3] <span itemprop="name">VfB Stuttgart</span>
[4] <span>X</span>

现在我想在每个<span itemprop="name"></span>中提取团队名称,但我不知道如何提取它。我尝试使用节点或属性,但它不起作用。

您可以使 XPath 选择器更具体,然后使用 html_text ,例如

library(rvest)
interwetten <- 'https://www.interwetten.com/en/sportsbook/top-leagues?topLinkId=1' %>% 
    read_html() 
teams <- interwetten %>% 
    html_nodes(xpath = '//*[@id="TBL_Content_1019"]//span[@itemprop="name"]') %>% 
    html_text()
teams
#>  [1] "VfB Stuttgart"   "1. FC Cologne"   "Mainz 05"       
#>  [4] "Hamburger SV"    "Hertha BSC"      "Schalke 04"     
#>  [7] "Hannover 96"     "Frankfurt"       "Hoffenheim"     
#> [10] "Augsburg"        "Bayern Munich"   "Freiburg"       
#> [13] "Dortmund"        "RB Leipzig"      "Leverkusen"     
#> [16] "Wolfsburg"       "Werder Bremen"   "Monchengladbach"

最新更新