准备多个URL用于在R中使用rvest进行网络抓取

我用rvest抓取多个URL时得到不一致的结果。URL的串联字符串返回一个字符向量。运行html_nodes返回三个不同的结果。

library(rvest)
url <- c("https://interestingengineering.com/due-to-the-space-inside-atoms-you-are-mostly- 
made-up-of-empty-space",
"https://futurism.com/mit-tech-self-driving-cars-see-under-surface-road",
"https://techxplore.com/news/2020-02-socially-robot-children-autism.html",
"https://eos.org/science-updates/hackathon-speeds-progress-toward-climate-model- 
collaboration",
"https://www.smithsonianmag.com/innovation/new-study-finds-people-prefer-robots- 
explain-themselves-180974299/",
"https://www.sciencedaily.com/releases/2020/02/200227144259.htm")
page <-map(url, ~read_html(.x) %>% html_nodes("p") %>% html_text())

此代码将返回从所有URL中提取的内容。

或者它会给出以下错误信息：

打开连接时出错(x，"rb"(：处理内容未编码时出错：设置的代码长度无效

或者此错误消息：

包装过程中出错：HTTP错误410。

在最后一条错误消息之后，我还在控制台中获得Browse[1]>。

我尝试从CSV文件运行URL：

urldoc<- read.csv("URLs for rvest.csv", stringsAsFactors=FALSE, sep = ",")
page <-map(urldoc, ~read_html(.x) %>% html_nodes("p") %>% html_text())

print(urldoc)的输出看起来与级联代码中的输出类似，但我得到了不同的错误消息：

doc_parse_file中的错误(con，encoding=encoding，as_html=as_html，options=options(：应为单个字符串值：[类型=字符；范围=83]

我无法在数据帧上运行html_node或html_text。

1( 如何获得无错误的一致性返回
2(更好的是，如何使用带有URL的文档而不是串联字符串？

您的第一个问题似乎是由URL上的换行引起的。

至于你的第二个问题：我可以从.csv中复制你的问题。这是我找到的解决方案。

urldoc<- read.csv("URLs for rvest.csv", stringsAsFactors=FALSE, sep = ",", header=FALSE)
page <-map(urldoc[,1], ~read_html(.x) %>% html_nodes("p") %>% html_text())

确保.csv每行只有一个URL，并指定要读取的列。

相关内容

最新更新

热门标签：