r语言 - 使用rest提取列表中每个城市的URL



我一直在研究vest包,并且有一个关于从列表中提取url的问题。我的目标是生成具有以下标题的df:国家、城市和城市的URL。我已经有了每个国家的df和每个国家的城市列表。

我的问题是,我如何引用每个城市,以便我可以获得其各自的URL链接?我试图在"可检索的可排序的jquery-tablesorter"中引用td类中的href。但是当我运行links = webpage %>% html_node("href") %>% html_text()时,我只得到主URL。

谢谢你的建议!

# Get URL
url = "https://en.wikipedia.org/wiki/List_of_towns_and_cities_with_100,000_or_more_inhabitants/country:_A-B"
# Read the HTML code from the website
page = read_html(url)
# Get name of the countries
countries = page %>% html_nodes(".mw-headline") %>% html_text()
#Remove the last two items which are not countries
countries = as.tibble(countries) %>%
slice(1:(n()-2))
#Add row number to each Country to left_join later
countries = rowid_to_column(countries, "column_label")
# Get cities for that country
# Still working on this since it includes the first table and I get blanks when I filter the html_nodes(".jquery-tablesorter td")
tables = html_nodes(page, "table")
tables = lapply(tables, html_table)
#Remove fist element which is not a city, only on the first page
tables = tables[-1]
#---WIP
# Get links for the cities, currently picks the main domain instead of the city
# Can I add a clause before the html node to indicate I want the href from "wikitable sortable jquery-tablesorter"?
links = page %>% html_attr("href") %>% html_text()
#---
#Remove the Providence and Population columns and keeps City and URL
tables = lapply(tables, "[", -c(2, 3))
#Standardize City as the column
tables = map(tables, set_names, "City")
# Flatten List
all <- bind_rows(tables, .id = "column_label") %>%
mutate(column_label = as.integer(column_label)) %>%
left_join(countries, by = "column_label")

这是一个完全可复制的示例,它为您提供了一个包含完整url的城市表:

library(tidyverse)
library(rvest)
"https://en.wikipedia.org/wiki/" %>%
paste0('List_of_towns_and_cities_with_100,000_or_more_inhabitants/') %>%
paste0('country:_A-B') %>%
read_html() %>%
html_nodes(xpath = "//table/tbody/tr") %>%
lapply(function(x) {
node <- xml2::xml_find_first(x, 'td/a') 
data.frame(city = html_attr(node, 'title'), 
url = paste0("https://en.wikipedia.org/wiki",
html_attr(node, 'href')))}) %>%
bind_rows() %>%
remove_missing(na.rm = TRUE) %>%
as_tibble()
#> # A tibble: 534 x 2
#>    city           url                                              
#>    <chr>          <chr>                                            
#>  1 Ghazni         https://en.wikipedia.org/wiki/wiki/Ghazni        
#>  2 Herat          https://en.wikipedia.org/wiki/wiki/Herat         
#>  3 Jalalabad      https://en.wikipedia.org/wiki/wiki/Jalalabad     
#>  4 Kabul          https://en.wikipedia.org/wiki/wiki/Kabul         
#>  5 Kandahar       https://en.wikipedia.org/wiki/wiki/Kandahar      
#>  6 Khost          https://en.wikipedia.org/wiki/wiki/Khost         
#>  7 Kunduz         https://en.wikipedia.org/wiki/wiki/Kunduz        
#>  8 Lashkargah     https://en.wikipedia.org/wiki/wiki/Lashkargah    
#>  9 Mazar-i-Sharif https://en.wikipedia.org/wiki/wiki/Mazar-i-Sharif
#> 10 Mihtarlam      https://en.wikipedia.org/wiki/wiki/Mihtarlam     
#> # ... with 524 more rows

创建于2023-01-06 with reprex v2.0.2

实现预期结果的一种方法可能是这样的。我采用了一种不同的方法,使用一个小的自定义函数通过抓取表行来获得所需的内容:

library(tidyverse)
library(rvest)
# Get a dataframe of city names and urls for one table
get_cities <- function(x) {
x %>%
html_nodes("tr") %>%
.[-1] %>%
# Get first column/cell containing city
html_node("td a") %>%
map_dfr(function(x) {
data.frame(
city = html_text(x),
url = paste0("https://en.wikipedia.org",
html_attr(x, 'href'))
)
})
}
url <- "https://en.wikipedia.org/wiki/List_of_towns_and_cities_with_100,000_or_more_inhabitants/country:_A-B"
# Read the HTML code from the website
webpage <- read_html(url)
# Get name of the countries
countries <- webpage %>%
html_nodes(".mw-headline") %>%
html_text()
countries <- countries[!grepl("(See also|References)", countries)]
# Get table nodes
tables <- webpage %>%
html_nodes("table.wikitable.sortable")
names(tables) <- countries
res <- map_dfr(tables, get_cities, .id = "country")
head(res)
#>       country      city             url
#> 1 Afghanistan    Ghazni    /wiki/Ghazni
#> 2 Afghanistan     Herat     /wiki/Herat
#> 3 Afghanistan Jalalabad /wiki/Jalalabad
#> 4 Afghanistan     Kabul     /wiki/Kabul
#> 5 Afghanistan  Kandahar  /wiki/Kandahar
#> 6 Afghanistan     Khost     /wiki/Khost

最新更新