R webscraping - links and urls



我有一个类似于刮擦网页,链接在页面上并与r形成表的问题。我本来可以将此作为对该主题的评论,但我还不够得分。

我有以下代码:

## Import web page
FAO_Countries <- read_html("http://www.fao.org/countryprofiles/en/")
## Import the urls I am interested in with 'selectorgadget'
FAO_Countries_urls <- FAO_Countries %>% 
 html_nodes(".linkcountry") %>% 
 html_attr("href")
## Import the links I am interested in with 'slectorgadget'
FAO_Countries_links <- FAO_Countries %>%
html_nodes(".linkcountry") %>% 
html_text()
## I create a dataframe with two previous objects
FAO_Countries_data <- data.frame(FAO_Countries_links = FAO_Countries_links, 
FAO_Countries_urls = FAO_Countries_urls, stringsAsFactors = FALSE)

在这一点上,我想从右侧的urol中获取文本,并在右侧添加为列,并以其他需要的内容来执行此操作。但是,当我编译

FAO_Countries_data_text <- FAO_Countries_data$FAO_Countries_urls %>%
html_nodes("#foodSecurity-1") %>%
html_text()

我收到以下错误消息:

Error in UseMethod("xml_find_all") : 
no applicable method for 'xml_find_all' applied to an object of class "character"

换句话说,我无法从新制作的数据框架中获取链接。

现在,我有一个数据框,如下:

> head(FAO_Countries_data, n=3)
  FAO_Countries_links                  FAO_Countries_urls
  1         Afghanistan /countryprofiles/index/en/?iso3=AFG
  2             Albania /countryprofiles/index/en/?iso3=ALB
  3             Algeria /countryprofiles/index/en/?iso3=DZA

我将通过添加列(包括各种URL中存在的信息)来扩展此数据框架。例如:

FAO_Countries_links                  FAO_Countries_urls      Food_security
  1         Afghanistan /countryprofiles/index/en/?iso3=AFG Family farming

使用以下代码,我可以提取"新闻项目"的文字" GSA-PABLICATION"one_answers" ProjectScountry"5个国家:

library(stringr)
library(rvest)
library(RDCOMClient)
## Import web page
FAO_Countries <- read_html("http://www.fao.org/countryprofiles/en/")
FAO_Countries_urls <- FAO_Countries %>% html_nodes(".linkcountry") %>% html_attr("href")
FAO_Countries_links <- FAO_Countries %>% html_nodes(".linkcountry") %>% html_text()
FAO_Countries_data <- data.frame(FAO_Countries_links = FAO_Countries_links, 
                                 FAO_Countries_urls = FAO_Countries_urls, stringsAsFactors = FALSE)
url <- paste0("http://www.fao.org", FAO_Countries_data$FAO_Countries_urls) 
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
list_News_Text <- list()
list_GSA_Publication <- list()
list_ProjectsCountry <- list()
for(i in 1 : 5)
{
  print(i)
  IEApp$Navigate(url[i])
  
  Sys.sleep(10)
  
  doc <- IEApp$Document()
  html_Content <- doc$documentElement()$innerText()
  web_Obj <- doc$getElementByID("newsItems")
  list_News_Text[[i]] <- web_Obj$innerText()
  web_Obj <- doc$getElementByID("gsa-publications")
  list_GSA_Publication[[i]] <- web_Obj$innerText()
  web_Obj <- doc$getElementByID("projectsCountry")
  list_ProjectsCountry[[i]] <- web_Obj$innerText()
}
print(list_News_Text)

您可以使用类似的方法来提取不同网页的其他项目。

最新更新