r-使用rvest从具有可折叠内容的网站中提取信息



网站https://www.moe.gov.sg/schoolfinder/schooldetail?schoolname=ZHONGHUA-SECONDARY-SCHOOL

我只想提取DSA talent areas offered in 2021下的信息

然而,当我使用选择器小工具获得路径.is--open:nth-child(4) .moe-collapsible__content

dsa <- html_node(listpage,".is--open:nth-child(4) .moe-collapsible__content") %>% html_text() %>% unlist()
dsa

输出为NA

有什么方法可以从可折叠的内容中获取信息吗?

一种方法是

library(rvest)
library(dplyr)
library(stringr)
'https://www.moe.gov.sg/schoolfinder/schooldetail?schoolname=ZHONGHUA-SECONDARY-SCHOOL' %>% 
read_html() %>% html_nodes('.moe-collapsible__content') %>% html_nodes('.moe-list') %>% html_text() %>% nth(3) %>% str_split('n')
[[1]]
[1] "Leadership and Character (Girls and Boys)r"                                 
[2] "                                        Chinese Orchestra (Girls and Boys)r"
[3] "                                        Choir (Girls and Boys)r"            
[4] "                                        Concert Band (Girls and Boys)r"     
[5] "                                        Guzheng Ensemble (Girls and Boys)r" 
[6] "                                        Badminton (Girls)r"                 
[7] "                                        Basketball (Girls)r"                
[8] "                                        Table Tennis (Boys)r"               
[9] "                                        Volleyball (Boys)r"   

您可以更精确地使用:contains with class来定位正确的父div,然后使用子选择器来移动到子li元素。通过使用部分字符串,您可能能够为2022年提供一些未来证明。

library(magrittr)
library(rvest)
read_html("https://www.moe.gov.sg/schoolfinder/schooldetail?schoolname=ZHONGHUA-SECONDARY-SCHOOL") %>%
html_elements('.moe-collapsible:contains("DSA talent areas") li') %>% html_text()

最新更新