Web从表中按日期和字符串抓取到R中



我需要从这里进行web抓取http://www.bls.gov/schedule/news_release/2015_sched.htm"发布"列下包含"就业情况"的每个日期。网页抓取输出应如下:

Friday, January 09, 2015
Friday, February 06, 2015
Friday, March 06, 2015
Friday, April 03, 2015
Friday, May 08, 2015
Friday, June 05, 2015
Thursday, July 02, 2015
Friday, August 07, 2015
Friday, September 04, 2015
Friday, October 02, 2015
Friday, November 06, 2015
Friday, December 04, 2015

为了做到这一点,我想重复以下12次,每个月一次。笔记http://www.bls.gov/schedule/news_release/2015_sched.htm包含12个表,每个月一个,分别命名为tbl2[[2]]tbl3[[3]],依此类推

library(rvest)
url <- 'http://www.bls.gov/schedule/news_release/2015_sched.htm'
ses <- html_session(url)
tbl <- html_table(ses, fill = T) 
nfpdates <- tbl[[2]]$`Date`
nfpdates <- gsub('\.', '', nfpdates)
nfpdates <- as.Date(nfpdates, 'weekdaystr(iD,:), %b %d, %Y')

它不起作用。第一个问题很简单:我不知道如何指一周中的哪一天:'weekdaystr(iD,:)是错误的。第二个更为复杂:如何只提取包含"就业情况";在";释放"?

如有任何帮助,我们将不胜感激。非常感谢。

这是XPath:的完美用例

library(rvest)
pg <- read_html("http://www.bls.gov/schedule/news_release/2015_sched.htm")
# we need to target only the <td> elements under the bodytext div
body <- html_nodes(pg, "div#bodytext")
# we use this new set of nodes and a relative XPath to get the initial <td> elements, then get their siblings
es_nodes <- html_nodes(body, xpath=".//td[contains(., 'Employment Situation for')]/../td[1]")
# clean up the cruft and make our dates!
as.Date(trimws(html_text(es_nodes)), format="%A, %B %d, %Y")
##  [1] "2015-01-09" "2015-02-06" "2015-03-06" "2015-03-18" "2015-04-03"
##  [6] "2015-05-08" "2015-06-05" "2015-07-02" "2015-08-07" "2015-09-04"
## [11] "2015-10-02" "2015-11-06" "2015-12-04"

就第一个问题而言,可以使用以下格式来解决:

nfpdates <- as.Date(nfpdates,"%A, %B %d, %Y")

现在,使用weekdays()函数,您可以找到一周中的哪一天。

现在,进入第二期,假设您正在提取"发布"栏下出现"就业情况"的日期,

可以通过以下方式完成:

test <- tbl[[2]]$Date
test[grepl('Employment Situation',tbl[[2]]$Release)]

最新更新