r-用字符串(rvest)填充web抓取中的空值



我正试图从网站上抓取用户评论。一些评论没有正文,所以我只剩下不同长度的向量;自变量意味着不同的行数:20、19";错误(20是正确的(,当试图将抓取的日期时间、评级和审查结果组合到一个数据帧中时。

我已经看了这里使用的解决方案!nzchar在html节点的长度为零时执行替换。这对我来说似乎是一个很好的解决方案,但我无法让代码将值插入到向量中以使长度正确。我用来抓取包含空值的节点的代码是:

library(rvest)
library(tidyverse)
library(stringr)
url <- "http://www.trustpilot.com/review/www.amazon.com?page=2"
working_page <- read_html(url)
working_reviews <- working_page %>%
html_nodes('.typography_body__9UBeQ.typography_color-black__5LYEn') %>%
html_text(trim=TRUE) %>%
replace(!nzchar(.), NA) %>%
str_trim() %>%
unlist()
length(working_reviews)
[1] 19

这将返回一个包含19个值的向量;我的预期输出是一个20个值的向量,其中"NA"填充了那些没有评审机构的值。在这一页上,第17次审查没有正文。

期望结果:

working_reviews[1]
[1] "I placed an order w/Amazon and selected the 18 payment plan. Amazon charged the entire amount to my card. Called them and got no where. I was told it was the banks fault and I had to take it up with them.Buyer be ware!!!"
working_reviews[17]
[17] "NA"

我也尝试过使用下面的行来";"力";在空评论中插入一个字符串:

working_reviews <- working_page %>%
html_nodes('.typography_body__9UBeQ.typography_color-black__5LYEn') %>%
html_text(trim=TRUE) %>%
replace(!nzchar(.), "No review") %>%
str_trim() %>%
unlist()

这产生了长度为19的相同结果,并且不包括包含"0"的元素;没有审查";。

我还尝试将nzchar代码反转为测试,删除了"!"并得到一个19元素的向量;NA";对于每个元素。

整洁地放入tibble中,如果缺少审阅,则返回NA

library(tidyverse)
library(rvest)
page <-
"https://www.trustpilot.com/review/www.amazon.com?page=2" %>%
read_html()
tibble(
name = page %>%  
html_elements(".styles_consumerName__dP8Um") %>% 
html_text2(),
rating = page %>% 
html_elements(".styles_reviewHeader__iU9Px img") %>% 
html_attr("alt") %>% 
parse_number(),
title = page %>% 
html_elements(".link_notUnderlined__szqki.typography_color-inherit__TlgPO") %>% 
html_text2(),
review = page %>%
html_elements(".styles_reviewCard__hcAvl") %>%
map(. %>%
html_element(".typography_body__9UBeQ") %>%
html_text2) %>%
unlist()
)
# A tibble: 20 x 4
name               rating title                               review
<chr>               <dbl> <chr>                               <chr> 
1 Octo Cavazos            1 I placed an order w/Amazon and sel~ "I pl~
2 Jeffrey Hayes           1 Don't waste your time,energy or mo~ "Don'~
3 Andy Here               1 Over the pandemic                   "Over~
4 Lorna Mills             1 Customer service                    "I or~
5 Daniel Sthamer          1 Prime delivery isn't worth it anym~ "Amaz~
6 Carolyn                 2 Amzon delivery is not worth the pr~ "Amaz~
7 BruceW                  5 “We apologize but Amazon has notic~ "“We ~
8 Matthew Smego           1 Aweful                              "Almo~
9 goku                    1 Prime membership traps…             "They~
10 Antoinette Barnett      2 Customer loyalty and/or history ar~ "Been~
11 AC                      1 Amazon has gone to sh**             "Amaz~
12 customer                1 so I ask for a refund back to my a~ "so I~
13 Will Chen               1 Rude and stupid customer service    "If p~
14 Matthew Blevins         1 Amazon Claims They Did Not Receive~ "I us~
15 Gem                     1 Ordered puppy food Monday received… "Orde~
16 SuzyJ                   1 On August 9 2022 it will have be t~ "On A~
17 Isabelle                1 Item arrived poorly packed and dam~  NA   
18 Hannah veibel           1 no Money returned                   "I or~
19 DiConti Jenine          1 Amazon is a fraudulent company.     "Amaz~
20 Urvashi                 1 Only Buyer oriented marketplace     "Does~

最新更新