我正在做一个web报废练习,我想使用下面的url获得下表:
https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory
更新日期:2022年4月7日新冠肺炎病例、死亡人数和发病率[5]
我右键单击浏览器,检查并想查找表ID/节点它将取代下面代码中的CCD_ 1。我找不到此节点。
library(tidyverse)
library(rvest)
# get the data
url <- "https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory"
html_data <- read_html(url)
html_data %>%
html_node("??") %>% # how do I get the node containing the table
html_table() %>%
as_tibble()
谢谢
我建议使用一个更稳定、更快、更具描述性的css选择器列表,而不是一个长而脆弱的xpath。有一个特定的父id(通常用于匹配的最快方法(和子表类(第二快(的组合可以使用:
library(magrittr)
library(rvest)
df <- read_html('https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory') %>%
html_element('#covid-19-cases-deaths-and-rates-by-location .wikitable') %>%
html_table()
推荐读数:
https://developer.mozilla.org/en-US/docs/Web/CSS/CSS_Selectors
实践:
https://flukeout.github.io/
使用浏览器获取表的xpath,并使用它代替"??"
。
suppressPackageStartupMessages({
library(httr)
library(rvest)
library(dplyr)
})
url <- "https://en.wikipedia.org/wiki/COVID-19_pandemic_by_country_and_territory"
xp <- "/html/body/div[3]/div[3]/div[5]/div[1]/div[15]/div[5]/table"
html_data <- read_html(url)
html_data %>%
html_elements(xpath = xp) %>% # how do I get the node containing the table
html_table() %>%
.[[1]] %>%
select(-1)
#> # A tibble: 218 x 4
#> Country `Deaths / million` Deaths Cases
#> <chr> <chr> <chr> <chr>
#> 1 World[a] 783 6,166,510 495,130,920
#> 2 Peru 6,366 212,396 3,549,511
#> 3 Bulgaria 5,314 36,655 1,143,424
#> 4 Bosnia and Herzegovina 4,819 15,728 375,948
#> 5 Hungary 4,738 45,647 1,863,039
#> 6 North Macedonia 4,433 9,234 307,142
#> 7 Montenegro 4,308 2,706 233,523
#> 8 Georgia 4,212 16,765 1,650,384
#> 9 Croatia 3,833 15,646 1,105,315
#> 10 Czech Republic 3,712 39,816 3,850,902
#> # ... with 208 more rows
创建于2022-04-08由reprex包(v2.0.1(