R webscraping - links and urls - R webscraping - links and urls 小贝子编程网

我有一个类似于刮擦网页，链接在页面上并与r形成表的问题。我本来可以将此作为对该主题的评论，但我还不够得分。

我有以下代码：

## Import web page
FAO_Countries <- read_html("http://www.fao.org/countryprofiles/en/")
## Import the urls I am interested in with 'selectorgadget'
FAO_Countries_urls <- FAO_Countries %>% 
 html_nodes(".linkcountry") %>% 
 html_attr("href")
## Import the links I am interested in with 'slectorgadget'
FAO_Countries_links <- FAO_Countries %>%
html_nodes(".linkcountry") %>% 
html_text()
## I create a dataframe with two previous objects
FAO_Countries_data <- data.frame(FAO_Countries_links = FAO_Countries_links, 
FAO_Countries_urls = FAO_Countries_urls, stringsAsFactors = FALSE)

在这一点上，我想从右侧的urol中获取文本，并在右侧添加为列，并以其他需要的内容来执行此操作。但是，当我编译

时

FAO_Countries_data_text <- FAO_Countries_data$FAO_Countries_urls %>%
html_nodes("#foodSecurity-1") %>%
html_text()

我收到以下错误消息：

Error in UseMethod("xml_find_all") : 
no applicable method for 'xml_find_all' applied to an object of class "character"

换句话说，我无法从新制作的数据框架中获取链接。

现在，我有一个数据框，如下：

> head(FAO_Countries_data, n=3)
  FAO_Countries_links                  FAO_Countries_urls
  1         Afghanistan /countryprofiles/index/en/?iso3=AFG
  2             Albania /countryprofiles/index/en/?iso3=ALB
  3             Algeria /countryprofiles/index/en/?iso3=DZA

我将通过添加列（包括各种URL中存在的信息）来扩展此数据框架。例如：

FAO_Countries_links                  FAO_Countries_urls      Food_security
  1         Afghanistan /countryprofiles/index/en/?iso3=AFG Family farming

使用以下代码，我可以提取"新闻项目"的文字" GSA-PABLICATION"one_answers" ProjectScountry"5个国家：

library(stringr)
library(rvest)
library(RDCOMClient)
## Import web page
FAO_Countries <- read_html("http://www.fao.org/countryprofiles/en/")
FAO_Countries_urls <- FAO_Countries %>% html_nodes(".linkcountry") %>% html_attr("href")
FAO_Countries_links <- FAO_Countries %>% html_nodes(".linkcountry") %>% html_text()
FAO_Countries_data <- data.frame(FAO_Countries_links = FAO_Countries_links, 
                                 FAO_Countries_urls = FAO_Countries_urls, stringsAsFactors = FALSE)
url <- paste0("http://www.fao.org", FAO_Countries_data$FAO_Countries_urls) 
IEApp <- COMCreate("InternetExplorer.Application")
IEApp[['Visible']] <- TRUE
list_News_Text <- list()
list_GSA_Publication <- list()
list_ProjectsCountry <- list()
for(i in 1 : 5)
{
  print(i)
  IEApp$Navigate(url[i])
  
  Sys.sleep(10)
  
  doc <- IEApp$Document()
  html_Content <- doc$documentElement()$innerText()
  web_Obj <- doc$getElementByID("newsItems")
  list_News_Text[[i]] <- web_Obj$innerText()
  web_Obj <- doc$getElementByID("gsa-publications")
  list_GSA_Publication[[i]] <- web_Obj$innerText()
  web_Obj <- doc$getElementByID("projectsCountry")
  list_ProjectsCountry[[i]] <- web_Obj$innerText()
}
print(list_News_Text)

您可以使用类似的方法来提取不同网页的其他项目。

R webscraping - links and urls

相关内容

最新更新

热门标签：