我正试图从网站上抓取用户评论。一些评论没有正文,所以我只剩下不同长度的向量;自变量意味着不同的行数:20、19";错误(20是正确的(,当试图将抓取的日期时间、评级和审查结果组合到一个数据帧中时。
我已经看了这里使用的解决方案!nzchar在html节点的长度为零时执行替换。这对我来说似乎是一个很好的解决方案,但我无法让代码将值插入到向量中以使长度正确。我用来抓取包含空值的节点的代码是:
library(rvest)
library(tidyverse)
library(stringr)
url <- "http://www.trustpilot.com/review/www.amazon.com?page=2"
working_page <- read_html(url)
working_reviews <- working_page %>%
html_nodes('.typography_body__9UBeQ.typography_color-black__5LYEn') %>%
html_text(trim=TRUE) %>%
replace(!nzchar(.), NA) %>%
str_trim() %>%
unlist()
length(working_reviews)
[1] 19
这将返回一个包含19个值的向量;我的预期输出是一个20个值的向量,其中"NA"填充了那些没有评审机构的值。在这一页上,第17次审查没有正文。
期望结果:
working_reviews[1]
[1] "I placed an order w/Amazon and selected the 18 payment plan. Amazon charged the entire amount to my card. Called them and got no where. I was told it was the banks fault and I had to take it up with them.Buyer be ware!!!"
working_reviews[17]
[17] "NA"
我也尝试过使用下面的行来";"力";在空评论中插入一个字符串:
working_reviews <- working_page %>%
html_nodes('.typography_body__9UBeQ.typography_color-black__5LYEn') %>%
html_text(trim=TRUE) %>%
replace(!nzchar(.), "No review") %>%
str_trim() %>%
unlist()
这产生了长度为19的相同结果,并且不包括包含"0"的元素;没有审查";。
我还尝试将nzchar代码反转为测试,删除了"!"并得到一个19元素的向量;NA";对于每个元素。
整洁地放入tibble中,如果缺少审阅,则返回NA
。
library(tidyverse)
library(rvest)
page <-
"https://www.trustpilot.com/review/www.amazon.com?page=2" %>%
read_html()
tibble(
name = page %>%
html_elements(".styles_consumerName__dP8Um") %>%
html_text2(),
rating = page %>%
html_elements(".styles_reviewHeader__iU9Px img") %>%
html_attr("alt") %>%
parse_number(),
title = page %>%
html_elements(".link_notUnderlined__szqki.typography_color-inherit__TlgPO") %>%
html_text2(),
review = page %>%
html_elements(".styles_reviewCard__hcAvl") %>%
map(. %>%
html_element(".typography_body__9UBeQ") %>%
html_text2) %>%
unlist()
)
# A tibble: 20 x 4
name rating title review
<chr> <dbl> <chr> <chr>
1 Octo Cavazos 1 I placed an order w/Amazon and sel~ "I pl~
2 Jeffrey Hayes 1 Don't waste your time,energy or mo~ "Don'~
3 Andy Here 1 Over the pandemic "Over~
4 Lorna Mills 1 Customer service "I or~
5 Daniel Sthamer 1 Prime delivery isn't worth it anym~ "Amaz~
6 Carolyn 2 Amzon delivery is not worth the pr~ "Amaz~
7 BruceW 5 “We apologize but Amazon has notic~ "“We ~
8 Matthew Smego 1 Aweful "Almo~
9 goku 1 Prime membership traps… "They~
10 Antoinette Barnett 2 Customer loyalty and/or history ar~ "Been~
11 AC 1 Amazon has gone to sh** "Amaz~
12 customer 1 so I ask for a refund back to my a~ "so I~
13 Will Chen 1 Rude and stupid customer service "If p~
14 Matthew Blevins 1 Amazon Claims They Did Not Receive~ "I us~
15 Gem 1 Ordered puppy food Monday received… "Orde~
16 SuzyJ 1 On August 9 2022 it will have be t~ "On A~
17 Isabelle 1 Item arrived poorly packed and dam~ NA
18 Hannah veibel 1 no Money returned "I or~
19 DiConti Jenine 1 Amazon is a fraudulent company. "Amaz~
20 Urvashi 1 Only Buyer oriented marketplace "Does~