如何使用r分割没有分隔符的合并/粘合单词



我使用R中的revest从本文页面中抓取文本关键字,使用以下代码:

#install.packages("xml2") # required for rvest
library("rvest") # for web scraping
library("dplyr") # for data management
#' start with get the link for the web to be scraped
page <- read_html("https://www.sciencedirect.com/science/article/pii/S1877042810004568")
keyW <- page %>% html_nodes("div.Keywords.u-font-serif") %>% html_text() %>% paste(collapse = ",")

它给了我:

> keyW    
[1] "KeywordsPhysics curriculumTurkish education systemfinnish education systemPISAphysics achievement"

去掉"关键词"后以及在它之前的字符串,使用这行代码:

keyW <- gsub(".*Keywords","", keyW)

新键w为:

[1] "Physics curriculumTurkish education systemfinnish education systemPISAphysics achievement"

然而,我想要的输出是这个列表:

[1] "Physics curriculum" "Turkish education system" "finnish education system" "PISA" "physics achievement"

我该如何处理这个问题?我认为这可以归结为:

  1. 如何从网站正确抓取关键词
  2. 如何正确分割字符串

感谢

如果使用span标记提取单词,您将直接得到预期的输出。

library(rvest)
page %>%  html_nodes("div.Keywords span") %>% html_text()
#[1] "Physics curriculum"       "Turkish education system" "finnish education system"
#[4] "PISA"                     "physics achievement"    

最新更新