R:在网页抓取多个页面中获取选择器的问题



我正在尝试在多个页面中进行网页抓取获得积分,可悲的是我在选择器中遇到了问题(我使用了选择器小工具但没有成功(。

我只有个人网页抓取的成功

library(rvest)
points <- read_html("https://www.winemag.com/buying-guide/lagar-de-bezana-2014-aluvion-ensamblaje-red-cachapoal-valley/")
points %>% 
html_node(".rating") %>%
html_text() 
[1] "93points"

对于多个页面,结果不是实际值:

library(rvest)
points <- lapply(paste0('https://www.winemag.com/?s=chile&search_type=all', 1:5),
function(url){
url %>% read_html() %>% 
html_nodes(".rating") %>% 
html_text()
})
points
[[1]]
[1] "93 Points" "92 Points" "92 Points" "92 Points" "92 Points" "92 Points"
[[2]]
[1] "93 Points" "92 Points" "92 Points" "92 Points" "92 Points" "92 Points"
[[3]]
[1] "93 Points" "92 Points" "92 Points" "92 Points" "92 Points" "92 Points"
[[4]]
[1] "93 Points" "92 Points" "92 Points" "92 Points" "92 Points" "92 Points"
[[5]]
[1] "93 Points" "92 Points" "92 Points" "92 Points" "92 Points" "92 Points"

这个解决方案似乎正在起作用。我更改了创建网址的方式:

library(rvest)
points <- lapply(paste0('https://www.winemag.com/?s=chile&drink_type=wine&page=', 1:5, '&search_type=all3', 1:5),
function(url){
url %>% read_html() %>% 
html_nodes(".rating") %>% 
html_text()
})
points

我个人会这样写,尽管这肯定是一个偏好问题:

library(rvest)
df <- tibble(url = paste0('https://www.winemag.com/?s=chile&drink_type=wine&page=', 1:5, '&search_type=all3', 1:5)) %>%
rowwise() %>%
mutate(
rating = read_html(url) %>% 
html_nodes(".rating") %>%
html_text() %>%
list()
) %>%
unnest(cols = c(rating))

最新更新