我遇到了一个公共数据集,我不知道如何直接进入R。通常,我使用以下R代码从网络上提取数据:
temp <- tempfile()
download.file("http://www.webaddress.com",temp)
data <- read.csv(unz(temp, "name_of_file"))
unlink(temp)
这个SEC网站,然而,让我有点困惑的是如何把它直接进入R.一个原因是,当你右键点击链接,而不是一个网址,你会得到以下代码:
javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$cphMain$lnkSECReport", "", false, "", "Content/BulkFeed/CompilationDownload.aspx?FeedPK=37264&FeedType=IA_FIRM_SEC", false, true))
网址:http://www.adviserinfo.sec.gov/IAPD/InvestmentAdviserData.aspx
是否有办法将这些数据直接进入R?到目前为止,我下载然后打开7-zip,保存到excel,然后导入到r。
更新代码
library(httr)
library(xml2)
res <- POST(url = "http://www.adviserinfo.sec.gov/IAPD/Content/BulkFeed/CompilationDownload.aspx?FeedPK=37264&FeedType=IA_FIRM_SEC",
httr::add_headers(Origin = "http://www.adviserinfo.sec.gov"),
body = list(`__EVENTTARGET` = "ctl00$cphMain$lnkSECReport"),
encode = "form")
writeBin(content(res, as="raw"), "report.gz")
gzf <- gzfile("report.gz")
doc <- read_xml(gzf)
close(gzf)
xml_find_all(doc, ".//Firms/Firm/Info") %>%
xml_attr("LegalNm") %>%
head(10)
这是一个真正的、可怕的、糟糕的SharePoint网站,它像疯了一样出现在全球几乎所有的政府电子计划中,使数据变得越来越不透明。
话虽如此,我还是很惊讶这个方法居然奏效了:
library(httr)
library(xml2)
res <- POST(url = "http://www.adviserinfo.sec.gov/IAPD/Content/BulkFeed/CompilationDownload.aspx?FeedPK=37264&FeedType=IA_FIRM_SEC",
httr::add_headers(Origin = "http://www.adviserinfo.sec.gov"),
body = list(`__EVENTTARGET` = "ctl00$cphMain$lnkSECReport"),
encode = "form")
在取消直接下载并在开发人员工具(必须在下载开始之前启动)中查看所述web调用后,我使用curlconverter
提取web调用。
原始计算的httr
请求函数如下所示:
httr::VERB(verb = "POST", url = "http://www.adviserinfo.sec.gov/IAPD/Content/BulkFeed/CompilationDownload.aspx?FeedPK=37264&FeedType=IA_FIRM_SEC",
httr::add_headers(Origin = "http://www.adviserinfo.sec.gov",
`Accept-Encoding` = "gzip, deflate", `Accept-Language` = "en-US,en;q=0.8",
`Upgrade-Insecure-Requests` = "1", `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.89 Safari/537.36",
Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
`Cache-Control` = "max-age=0", Referer = "http://www.adviserinfo.sec.gov/IAPD/InvestmentAdviserData.aspx",
Connection = "keep-alive", DNT = "1"), httr::set_cookies(ASP.NET_SessionId = "vp5bt2nrl5m3l4tqq4mkbfrz"),
body = list(`__EVENTTARGET` = "ctl00$cphMain$lnkSECReport",
`__EVENTARGUMENT` = "", `__VIEWSTATE` = "/wEPDwUIOTg2OTY2NjYPZBYCZg9kFgQCAQ8WAh4EVGV4dAUeSUFQRCAtIEludmVzdG1lbnQgQWR2aXNlciBEYXRhZAIDD2QWAgIFD2QWEAIDDw8WAh4LUG9zdEJhY2tVcmwFUn4vSUFQRC9Db250ZW50L0J1bGtGZWVkL0NvbXBpbGF0aW9uRG93bmxvYWQuYXNweD9GZWVkUEs9MzcyNjQmRmVlZFR5cGU9SUFfRklSTV9TRUMWAh4Hb25jbGljawWvAWdhKCdzZW5kJywgJ3BhZ2V2aWV3JywgeydwYWdlJzogJ34vSUFQRC9Db250ZW50L0J1bGtGZWVkL0NvbXBpbGF0aW9uRG93bmxvYWQuYXNweD9GZWVkUEs9MzcyNjQmRmVlZFR5cGU9SUFfRklSTV9TRUMnLCAndGl0bGUnOiAnSUFQRCAtIFNFQyBJbnZlc3RtZW50IEFkdmlzZXIgUmVwb3J0IChHWklQKSd9KTtkAgcPZBYCZg8PFgIfAAVKUmVwb3J0IGFzIG9mOiA8Yj5TZXB0ZW1iZXIgNiwgMjAxNjwvYj4gPGJyLz5BcHByb3hpbWF0ZSBmaWxlIHNpemU6IDM3IE1CICBkZAINDw8WAh8BBVR+L0lBUEQvQ29udGVudC9CdWxrRmVlZC9Db21waWxhdGlvbkRvd25sb2FkLmFzcHg/RmVlZFBLPTM3MjY1JkZlZWRUeXBlPUlBX0ZJUk1fU1RBVEUWAh8CBbMBZ2EoJ3NlbmQnLCAncGFnZXZpZXcnLCB7J3BhZ2UnOiAnfi9JQVBEL0NvbnRlbnQvQnVsa0ZlZWQvQ29tcGlsYXRpb25Eb3dubG9hZC5hc3B4P0ZlZWRQSz0zNzI2NSZGZWVkVHlwZT1JQV9GSVJNX1NUQVRFJywgJ3RpdGxlJzogJ0lBUEQgLSBTdGF0ZSBJbnZlc3RtZW50IEFkdmlzZXIgUmVwb3J... <truncated>
`__VIEWSTATEGENERATOR` = "C7F140E8", `__PREVIOUSPAGE` = "_n_AIWFFdFo0uFQroVexEbLyjk41mQczgUv0yM_5WfsMAs5Mr4_W9OsfhauW1md49E6AtLMLKvwsM3efjdsFxSQVs8m60rXjM2G3a38s-vs9jeifY7Z97KwNciQDnS3E0",
`__EVENTVALIDATION` = "/wEdAAQgBK7oCoSH1SyM/nnv4+7OQ6BBh5UglL0V4PbvTmfHL5ETgQBTBoVSpnQmZd0nxKz/1ubqHHzGDP0ztOLUKJjXWi90IlgKV4uaEBSHcRvGBiO1/K20oSh88Xa2qq9BBCI="),
encode = "form")
,根据我的经验,这些真正邪恶的SharePoint网站需要各种各样的"视图状态"信息,但我试着减少和转换调用,它正在工作(至少在我最初访问该网站后的2分钟内)。
你还没有脱离险境,因为:
res$headers$`content-type`
## "application/x-gzip; charset=utf-8"
即使你添加:
`Accept-Encoding` = "gzip, deflate"
呼叫add_headers()
。
因此,由于memDecompress()
是一个绝对无用的函数,您需要:
writeBin(content(res, as="raw"), "report.gz")
将gzip后的内容压缩到文件中。
现在,我们可以直接使用它了:
gzf <- gzfile("report.gz")
doc <- read_xml(gzf)
## [1] "LAUNCH ANGELS MANAGEMENT COMPANY, LLC" "JACOBSEN CAPITAL MANAGEMENT, LLC"
## [3] "CORESTATES CAPITAL ADVISORS, LLC" "MINNEAPOLIS PORTFOLIO MANAGEMENT GROUP, LLC"
## [5] "SHANNON RIVER FUND MANAGEMENT, LLC" "AAC BENELUX HOLDING BV"
## [7] "WILLINK ASSET MANAGEMENT LLC" "SPIVAK ASSET MANAGEMENT, LLC"
## [9] "ANNALY MANAGEMENT COMPANY LLC" "WOODMONT INVESTMENT COUNSEL, LLC"
close(gzf)
xml_find_all(doc, ".//Firms/Firm/Info") %>%
xml_attr("LegalNm") %>%
head(10)
我没有试过,但我想你可以试试:
javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(
-----> "ctl00$cphMain$lnkStateReport",
"",
false,
"",
-----> "Content/BulkFeed/CompilationDownload.aspx?FeedPK=37265&FeedType=IA_FIRM_STATE",
false,
true))
将----->
识别的项目放在url
和body
区域的明显位置,以获取其他内容。这些参数来自"国家投资顾问报告"按钮链接来源。
如果你真的不想将内容写入文件,你可以尝试我的alpha包中的一个非暴露函数,直接在R中膨胀gzip的原始内容:
devtools::install_git("https://gitlab.com/hrbrmstr/warc.gz")
raw_report <- warc:::gzuncompress(content(res, as="raw"), 50*1024*1024)
doc <- read_xml(raw_report)
...