没有 css 路径 R 的数据抓取文本

你好，我写信给你，因为我正在打破我的头来寻找一种方法并从网页("https://nabtu.org/about-nabtu/official-directory/building-trades-local-councils-overview/"(中删除数据。我这样做是为了练习，只是为了学习如何报废数据。我正在尝试删除上述网页(Office，传真，电子邮件(的联系数据，但我无法做到这一点，因为我无法使用Selectorgadget获得某些css路径.我正在使用R，我正在使用的scrip有点像这样。

library(rvest)
page_name <-read_html("page html")

page_name %>%
html_node("selector gadget node") %>%
html_text()

我抓取了所有其他数据，我只是无法删除此联系信息。任何帮助都将不胜感激，因为我的头会爆炸。提前谢谢。

我看不出问题出在哪里。每个联系人块都有一个.council-list列表类。使用它，您可以单独提取联系信息。之后，使用一些字符串/正则表达式操作来提取确切的字段。

library(rvest)
page_name <- read_html('https://nabtu.org/about-nabtu/official-directory/building-trades-local-councils-overview/')
contact_strings = page_name %>%
html_nodes('.council-list') %>%
html_text()
# Filter out strings that don't contain contact information
contact_strings = grep(x = contact_strings, 'Email|Fax|office', ignore.case = T, value = T)
# Extract infomration 
library(stringr)
library(magrittr)
office = str_extract(contact_strings, 'Office:[^[:alpha:]]*')
fax = str_extract(contact_strings, 'Fax:[^[:alpha:]]*')
email = str_extract(contact_strings, 'Email: [^ ]*')

相关内容

最新更新

热门标签：