r语言 - 修复用于提取给定不同url的产品评论的For循环中的错误



我试图提取对亚马逊产品的评论,评论的url与不同的页码放在同一url上,手动运行此脚本正在工作,但我需要手动更改url中的页面数量和标题的名称,并每次运行以获得不同的标题。

因为写了将近70页都很无聊所以我试着写一个for循环来做同样的事情在循环下面我试着做的但是它给了我一个错误

MANUAL 
```
library(tidyr)
library(rvest)
url_reviews <- "https://www.amazon.it/Philips-HD9260-90-Airfryer-plastica/product-reviews/B07WTHVQZH/ref=cm_cr_getr_d_paging_btm_next_16?ie=UTF8&reviewerType=all_reviews&pageNumber=16"
doc <- read_html(url_reviews) # Assign results to `doc`
# Review Title
doc %>% 
html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
html_text() -> review_title
# Review Text
doc %>% 
html_nodes("[class='a-size-base review-text review-text-content']") %>%
html_text() -> review_text
# Number of stars in review
doc %>%
html_nodes("[data-hook='review-star-rating']") %>%
html_text() -> review_star
# Return a tibble
page_16<-data.frame(review_title,
review_text,
review_star,
page =16) 

FOR LOOP
``` 
range <- 12:82
url_max <- paste0("https://www.amazon.it/Philips-HD9260-90-Airfryer-plastica/product-reviews/B07WTHVQZH/ref=cm_cr_getr_d_paging_btm_next_", range ,"?ie=UTF8&reviewerType=all_reviews&pageNumber=",range)


for (i in 1:length(url_max)) {

doc <- read_html(url_max[i]) # Assign results to `doc`

# Review Title
doc %>% 
html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
html_text() -> review_title

# Review Text
doc %>% 
html_nodes("[class='a-size-base review-text review-text-content']") %>%
html_text() -> review_text

# Number of stars in review
doc %>%
html_nodes("[data-hook='review-star-rating']") %>%
html_text() -> review_star


paste0("page_", range)<-tibble(review_title,
review_text,
review_star,
page = paste0("a", i)) 
                 
}
```

这是另一种选择,它定义了一个函数,然后使用lapply()顺序运行该函数。

但是,以下内容可能有助于在不同产品需要时重复此操作。该函数接受两个参数,第一个i是页码,第二个product是您正在收集评论的产品。该函数通过粘贴相应的页码来构造url。

当我使用lapply()时,下面的函数也可以插入Ronak的答案中的map_df()函数中(并且可能比绑定行更快)。

library(dplyr)
library(rvest)
library(stringr)
retrieve_reviews <- function(i, product) {
urlstr <- "https://www.amazon.it/product-reviews/${product}/ref=cm_cr_getr_d_paging_btm_next_${i}?ie=UTF8&reviewerType=all_reviews&pageNumber=${i}"
url <- str_interp(urlstr, list(product = product, i = i))
doc <- read_html(url) # Assign results to `doc`

# Review Title
doc %>% 
html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
html_text() -> review_title

# Review Text
doc %>% 
html_nodes("[class='a-size-base review-text review-text-content']") %>%
html_text() -> review_text

# Number of stars in review
doc %>%
html_nodes("[data-hook='review-star-rating']") %>%
html_text() -> review_star

return(tibble(
title = review_title,
text = review_text,
star = review_star,
page = paste0("a", i)
))
}

range <- 12:82
product <- "B07WTHVQZH"
reviews <- lapply(range, retrieve_reviews, product) %>%
bind_rows()

您可以从purrr中使用map_df来使用loop。

library(rvest)
page_numbers <- 12:82
purrr::map_df(page_numbers, ~{
url_reviews <- paste0("https://www.amazon.it/Philips-HD9260-90-Airfryer-plastica/product-reviews/B07WTHVQZH/ref=cm_cr_getr_d_paging_btm_next_16?ie=UTF8&reviewerType=all_reviews&pageNumber=", .x)  
doc <- read_html(url_reviews) # Assign results to `doc`


# Review Title
doc %>% 
html_nodes("[class='a-size-base a-link-normal review-title a-color-base review-title-content a-text-bold']") %>%
html_text() -> review_title

# Review Text
doc %>% 
html_nodes("[class='a-size-base review-text review-text-content']") %>%
html_text() -> review_text

# Number of stars in review
doc %>%
html_nodes("[data-hook='review-star-rating']") %>%
html_text() -> review_star

# Return a tibble
data.frame(review_title,
review_text,
review_star,
page =.x) 
}) -> result
result

最新更新