使用purrr::map抓取多篇文章,而不是r中的for循环



大家好。

我现在正试图获得r在这个网站(https://yomou.syosetu.com/search.php?&type=er&order_former=search&order=new&notnizi=1&p=1)上的文章标题的数据。

我执行了以下代码:

### read HTML ###
html_narou <- rvest::read_html("https://yomou.syosetu.com/search.php?&type=er&order_former=search&order=new&notnizi=1&p=1",
encoding = "UTF-8")
### create the common part object of CSS ###
base_css_former <- "#main_search > div:nth-child("
base_css_latter <- ") > div > a"
### create NULL objects ###
art_css <- NULL
narou_titles <- NULL
### extract the title data and store them into the NULL object ###
#### The titles of the articles doesn't exist in the " #main_search > div:nth-child(1~4) > div > a ", so i in the loop starts from five ####
for (i in 5:24) {
art_css <- paste0(base_css_former, as.character(i), base_css_latter) 

narou_title <- rvest::html_element(x = html_narou,
css = art_css) %>% 
rvest::html_text()
narou_titles <- base::append(narou_titles, narou_title)
}

但是在R中通过for循环来完成这个需要很长时间,我想使用"map"函数中的"purrr"代替。但是我不熟悉purrr::map,而且过程比较复杂。如何用map代替for-loop?

真正的问题是,您在每次迭代中都增加了narou_titles向量的大小,这在r中是出了名的慢。相反,您应该预先分配向量的最终长度,然后按索引分配元素。Purrr在后台完成这个,这可以使它看起来更快,但是你可以不使用Purrr做同样的事情。

与您的for循环:

library(rvest)
narou_titles <- vector("character", 20)
for (i in 5:24) {
art_css <- paste0(base_css_former, as.character(i), base_css_latter) 

narou_titles[[i]] <- html_element(
x = html_narou,
css = art_css
) %>% 
html_text()
}

Withpurrr::map_chr():

library(rvest)
library(purrr)
get_title <- function(i) {
art_css <- paste0(base_css_former, as.character(i), base_css_latter)  
html_element(
x = html_narou,
css = art_css
) %>% 
html_text()
}
narou_titles <- map_chr(5:24, get_title)

最新更新