r语言 - 使用 rvest 提取两个标题标签 () 之间的所有文本和标签<h3> - r - Extract all text & tags between two heading tags (<h3>) with rvest 小贝子编程网

这个页面有六个部分列出了<h3>标签之间的人。

如何使用XPath分别选择这六个部分(使用rvest)，或者将它们放入嵌套列表中?我的目标是稍后lapply通过这六个部分来获取人们的姓名和从属关系(由部分分开)。

HTML的结构不是很好，即不是每个文本都位于特定的标签中。一个例子:

<h3>Editor-in-Chief</h3>
Claudio Ronco &ndash; <i>St. Bartolo Hospital</i>, Vicenza, Italy<br />
<br />
<h3>Clinical Engineering</h3>
William R. Clark &ndash; <i>Purdue University</i>, West Lafayette, IN, USA<br />
Hideyuki Kawanashi &ndash; <i>Tsuchiya General Hospital</i>, Hiroshima, Japan<br />

我使用以下代码访问该站点:

journal_url <- "https://www.karger.com/Journal/EditorialBoard/223997"
webpage <- rvest::html_session(journal_url,
                  httr::user_agent("Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20"))
webpage <- rvest::html_nodes(webpage, css = '#editorialboard')

我尝试了各种xpath将html_nodes的六个部分提取到六个列表的嵌套列表中，但它们都不能正常工作:

# this gives me a list of 190 (instead of 6) elements, leaving out the text between <i> and </i>
webpage <- rvest::html_nodes(webpage, xpath = '//text()[preceding-sibling::h3 and following-sibling::h3]')
# this gives me a list of 190 (instead of 6) elements, leaving out text that are not between tags
webpage <- rvest::html_nodes(webpage, xpath = '//*[preceding-sibling::h3 and following-sibling::h3]')
# error "VECTOR_ELT() can only be applied to a 'list', not a 'logical'"
webpage <- rvest::html_nodes(webpage, xpath = '//* and text()[preceding-sibling::h3 and following-sibling::h3]')
# this gives me a list of 274 (instead of 6) elements
webpage <- rvest::html_nodes(webpage, xpath = '//text()[preceding-sibling::h3]')

您能接受不使用XPath的丑陋解决方案吗?我不认为你可以从这个网站的结构中得到一个嵌套列表……但是我对xpath不是很有经验。

我首先得到标题，使用标题名称划分原始文本，然后，在每个组中，使用'n'作为分隔符划分成员。

journal_url <- "https://www.karger.com/Journal/EditorialBoard/223997"
webpage <- read_html(journal_url) %>% html_node(css = '#editorialboard')
# get h3 headings
headings <- webpage %>% html_nodes('h3') %>% html_text()
# get raw text
raw.text <- webpage %>% html_text()
# split raw text on h3 headings and put in a list
list.members <- list()
raw.text.2 <- raw.text
for (h in headings) {
  # split on headings
  b <- strsplit(raw.text.2, h, fixed=TRUE)
  # split members using n as separator
  c <- strsplit(b[[1]][1], 'n', fixed=TRUE)
  # clean empty elements from vector
  c <- list(c[[1]][c[[1]] != ""])
  # add vector of member to list
  list.members <- c(list.members, c)
  # update text
  raw.text.2 <- b[[1]][2]
}
# remove first element of main list
list.members <- list.members[2:length(list.members)]
# add final segment of raw.text to list
c <- strsplit(raw.text.2, 'n', fixed=TRUE)
c <- list(c[[1]][c[[1]] != ""])
list.members <- c(list.members, c)
# add names to list
names(list.members) <- headings

然后你得到一个组的列表，列表中的每个元素都是一个向量，每个成员都有字符串(使用all info)

> list.members$`Editor-in-Chief`
[1] "Claudio Ronco – St. Bartolo Hospital, Vicenza, Italy"
> list.members$`Clinical Engineering`
 [1] "William R. Clark – Purdue University, West Lafayette, IN, USA"                     
 [2] "Hideyuki Kawanashi – Tsuchiya General Hospital, Hiroshima, Japan"                  
 [3] "Tadayuki Kawasaki – Mobara Clinic, Mobara City, Japan"                             
 [4] "Jeongchul Kim – Wake Forest School of Medicine, Winston-Salem, NC, USA"            
 [5] "Anna Lorenzin – International Renal Research Institute of Vicenza, Vicenza, Italy" 
 [6] "Ikuto Masakane – Honcho Yabuki Clinic, Yamagata City, Japan"                       
 [7] "Michio Mineshima – Tokyo Women's Medical University, Tokyo, Japan"                 
 [8] "Tomotaka Naramura – Kurashiki University of Science and the Arts, Kurashiki, Japan"
 [9] "Mauro Neri – International Renal Research Institute of Vicenza, Vicenza, Italy"    
[10] "Masanori Shibata – Koujukai Rehabilitation Hospital, Nagoya City, Japan"           
[11] "Yoshihiro Tange – Kyushu University of Health and Welfare, Nobeoka-City, Japan"    
[12] "Yoshiaki Takemoto – Osaka City University, Osaka City, Japan"

r语言 - 使用 rvest 提取两个标题标签 () 之间的所有文本和标签<h3>

相关内容

最新更新

热门标签：