我有一个类似于刮擦网页,链接在页面上并与r形成表的问题。我本来可以将此作为对该主题的评论,但我还不够得分。
我有以下代码:
## Import web page
FAO_Countries <- read_html("http://www.fao.org/countryprofiles/en/")
## Import the urls I am interested in with 'selectorgadget'
FAO_Countries_urls <- FAO_Countries %>%
html_nodes(".linkcountry") %>%
html_attr("href")
## Import the links I am interested in with 'slectorgadget'
FAO_Countries_links <- FAO_Countries %>%
html_nodes(".linkcountry") %>%
html_text()
## I create a dataframe with two previous objects
FAO_Countries_data <- data.frame(FAO_Countries_links = FAO_Countries_links,
FAO_Countries_urls = FAO_Countries_urls, stringsAsFactors = FALSE)
在这一点上,我想从右侧的urol中获取文本,并在右侧添加为列,并以其他需要的内容来执行此操作。但是,当我编译
时FAO_Countries_data_text <- FAO_Countries_data$FAO_Countries_urls %>%
html_nodes("#foodSecurity-1") %>%
html_text()
我收到以下错误消息:
Error in UseMethod("xml_find_all") :
no applicable method for 'xml_find_all' applied to an object of class "character"
换句话说,我无法从新制作的数据框架中获取链接。
现在,我有一个数据框,如下:
> head(FAO_Countries_data, n=3)
FAO_Countries_links FAO_Countries_urls
1 Afghanistan /countryprofiles/index/en/?iso3=AFG
2 Albania /countryprofiles/index/en/?iso3=ALB
3 Algeria /countryprofiles/index/en/?iso3=DZA
我将通过添加列(包括各种URL中存在的信息)来扩展此数据框架。例如:
FAO_Countries_links FAO_Countries_urls Food_security
1 Afghanistan /countryprofiles/index/en/?iso3=AFG Family farming
使用以下代码,我可以提取"新闻项目"的文字" GSA-PABLICATION"one_answers" ProjectScountry"5个国家:
library(stringr)
library(rvest)
library(RDCOMClient)
## Import web page
FAO_Countries <- read_html("http://www.fao.org/countryprofiles/en/")
FAO_Countries_urls <- FAO_Countries %>% html_nodes(".linkcountry") %>% html_attr("href")
FAO_Countries_links <- FAO_Countries %>% html_nodes(".linkcountry") %>% html_text()
FAO_Countries_data <- data.frame(FAO_Countries_links = FAO_Countries_links,
FAO_Countries_urls = FAO_Countries_urls, stringsAsFactors = FALSE)
url <- paste0("http://www.fao.org", FAO_Countries_data$FAO_Countries_urls)
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
list_News_Text <- list()
list_GSA_Publication <- list()
list_ProjectsCountry <- list()
for(i in 1 : 5)
{
print(i)
IEApp$Navigate(url[i])
Sys.sleep(10)
doc <- IEApp$Document()
html_Content <- doc$documentElement()$innerText()
web_Obj <- doc$getElementByID("newsItems")
list_News_Text[[i]] <- web_Obj$innerText()
web_Obj <- doc$getElementByID("gsa-publications")
list_GSA_Publication[[i]] <- web_Obj$innerText()
web_Obj <- doc$getElementByID("projectsCountry")
list_ProjectsCountry[[i]] <- web_Obj$innerText()
}
print(list_News_Text)
您可以使用类似的方法来提取不同网页的其他项目。