将SharePoint站点数据导入R



我遇到了一个公共数据集,我不知道如何直接进入R。通常,我使用以下R代码从网络上提取数据:

temp <- tempfile()
download.file("http://www.webaddress.com",temp)
data <- read.csv(unz(temp, "name_of_file"))
unlink(temp)

这个SEC网站,然而,让我有点困惑的是如何把它直接进入R.一个原因是,当你右键点击链接,而不是一个网址,你会得到以下代码:

javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions("ctl00$cphMain$lnkSECReport", "", false, "", "Content/BulkFeed/CompilationDownload.aspx?FeedPK=37264&FeedType=IA_FIRM_SEC", false, true))

网址:http://www.adviserinfo.sec.gov/IAPD/InvestmentAdviserData.aspx

是否有办法将这些数据直接进入R?到目前为止,我下载然后打开7-zip,保存到excel,然后导入到r。

更新代码

library(httr)
library(xml2)
res <- POST(url = "http://www.adviserinfo.sec.gov/IAPD/Content/BulkFeed/CompilationDownload.aspx?FeedPK=37264&FeedType=IA_FIRM_SEC", 
            httr::add_headers(Origin = "http://www.adviserinfo.sec.gov"), 
            body = list(`__EVENTTARGET` = "ctl00$cphMain$lnkSECReport"), 
            encode = "form")
writeBin(content(res, as="raw"), "report.gz")
gzf <- gzfile("report.gz")
doc <- read_xml(gzf)
close(gzf)

xml_find_all(doc, ".//Firms/Firm/Info") %>% 
  xml_attr("LegalNm") %>% 
  head(10)

这是一个真正的、可怕的、糟糕的SharePoint网站,它像疯了一样出现在全球几乎所有的政府电子计划中,使数据变得越来越不透明。

话虽如此,我还是很惊讶这个方法居然奏效了:

library(httr)
library(xml2)
res <- POST(url = "http://www.adviserinfo.sec.gov/IAPD/Content/BulkFeed/CompilationDownload.aspx?FeedPK=37264&FeedType=IA_FIRM_SEC", 
           httr::add_headers(Origin = "http://www.adviserinfo.sec.gov"), 
           body = list(`__EVENTTARGET` = "ctl00$cphMain$lnkSECReport"), 
           encode = "form")

在取消直接下载并在开发人员工具(必须在下载开始之前启动)中查看所述web调用后,我使用curlconverter提取web调用。

原始计算的httr请求函数如下所示:

httr::VERB(verb = "POST", url = "http://www.adviserinfo.sec.gov/IAPD/Content/BulkFeed/CompilationDownload.aspx?FeedPK=37264&FeedType=IA_FIRM_SEC", 
           httr::add_headers(Origin = "http://www.adviserinfo.sec.gov", 
                             `Accept-Encoding` = "gzip, deflate", `Accept-Language` = "en-US,en;q=0.8", 
                             `Upgrade-Insecure-Requests` = "1", `User-Agent` = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.89 Safari/537.36", 
                             Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8", 
                             `Cache-Control` = "max-age=0", Referer = "http://www.adviserinfo.sec.gov/IAPD/InvestmentAdviserData.aspx", 
                             Connection = "keep-alive", DNT = "1"), httr::set_cookies(ASP.NET_SessionId = "vp5bt2nrl5m3l4tqq4mkbfrz"), 
           body = list(`__EVENTTARGET` = "ctl00$cphMain$lnkSECReport", 
                       `__EVENTARGUMENT` = "", `__VIEWSTATE` = "/wEPDwUIOTg2OTY2NjYPZBYCZg9kFgQCAQ8WAh4EVGV4dAUeSUFQRCAtIEludmVzdG1lbnQgQWR2aXNlciBEYXRhZAIDD2QWAgIFD2QWEAIDDw8WAh4LUG9zdEJhY2tVcmwFUn4vSUFQRC9Db250ZW50L0J1bGtGZWVkL0NvbXBpbGF0aW9uRG93bmxvYWQuYXNweD9GZWVkUEs9MzcyNjQmRmVlZFR5cGU9SUFfRklSTV9TRUMWAh4Hb25jbGljawWvAWdhKCdzZW5kJywgJ3BhZ2V2aWV3JywgeydwYWdlJzogJ34vSUFQRC9Db250ZW50L0J1bGtGZWVkL0NvbXBpbGF0aW9uRG93bmxvYWQuYXNweD9GZWVkUEs9MzcyNjQmRmVlZFR5cGU9SUFfRklSTV9TRUMnLCAndGl0bGUnOiAnSUFQRCAtIFNFQyBJbnZlc3RtZW50IEFkdmlzZXIgUmVwb3J0IChHWklQKSd9KTtkAgcPZBYCZg8PFgIfAAVKUmVwb3J0IGFzIG9mOiA8Yj5TZXB0ZW1iZXIgNiwgMjAxNjwvYj4gPGJyLz5BcHByb3hpbWF0ZSBmaWxlIHNpemU6IDM3IE1CICBkZAINDw8WAh8BBVR+L0lBUEQvQ29udGVudC9CdWxrRmVlZC9Db21waWxhdGlvbkRvd25sb2FkLmFzcHg/RmVlZFBLPTM3MjY1JkZlZWRUeXBlPUlBX0ZJUk1fU1RBVEUWAh8CBbMBZ2EoJ3NlbmQnLCAncGFnZXZpZXcnLCB7J3BhZ2UnOiAnfi9JQVBEL0NvbnRlbnQvQnVsa0ZlZWQvQ29tcGlsYXRpb25Eb3dubG9hZC5hc3B4P0ZlZWRQSz0zNzI2NSZGZWVkVHlwZT1JQV9GSVJNX1NUQVRFJywgJ3RpdGxlJzogJ0lBUEQgLSBTdGF0ZSBJbnZlc3RtZW50IEFkdmlzZXIgUmVwb3J... <truncated>
                       `__VIEWSTATEGENERATOR` = "C7F140E8", `__PREVIOUSPAGE` = "_n_AIWFFdFo0uFQroVexEbLyjk41mQczgUv0yM_5WfsMAs5Mr4_W9OsfhauW1md49E6AtLMLKvwsM3efjdsFxSQVs8m60rXjM2G3a38s-vs9jeifY7Z97KwNciQDnS3E0", 
                       `__EVENTVALIDATION` = "/wEdAAQgBK7oCoSH1SyM/nnv4+7OQ6BBh5UglL0V4PbvTmfHL5ETgQBTBoVSpnQmZd0nxKz/1ubqHHzGDP0ztOLUKJjXWi90IlgKV4uaEBSHcRvGBiO1/K20oSh88Xa2qq9BBCI="), 
                       encode = "form")

,根据我的经验,这些真正邪恶的SharePoint网站需要各种各样的"视图状态"信息,但我试着减少和转换调用,它正在工作(至少在我最初访问该网站后的2分钟内)。

你还没有脱离险境,因为:

res$headers$`content-type`
## "application/x-gzip; charset=utf-8"

即使你添加:

`Accept-Encoding` = "gzip, deflate"

呼叫add_headers()

因此,由于memDecompress()是一个绝对无用的函数,您需要:

writeBin(content(res, as="raw"), "report.gz")

将gzip后的内容压缩到文件中。

现在,我们可以直接使用它了:

gzf <- gzfile("report.gz")
doc <- read_xml(gzf)
## [1] "LAUNCH ANGELS MANAGEMENT COMPANY, LLC"       "JACOBSEN CAPITAL MANAGEMENT, LLC"           
## [3] "CORESTATES CAPITAL ADVISORS, LLC"            "MINNEAPOLIS PORTFOLIO MANAGEMENT GROUP, LLC"
## [5] "SHANNON RIVER FUND MANAGEMENT, LLC"          "AAC BENELUX HOLDING BV"                     
## [7] "WILLINK ASSET MANAGEMENT LLC"                "SPIVAK ASSET MANAGEMENT, LLC"               
## [9] "ANNALY MANAGEMENT COMPANY LLC"               "WOODMONT INVESTMENT COUNSEL, LLC"           
close(gzf)
xml_find_all(doc, ".//Firms/Firm/Info") %>% 
  xml_attr("LegalNm") %>% 
  head(10)

我没有试过,但我想你可以试试:

javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(
  -----> "ctl00$cphMain$lnkStateReport", 
  "", 
  false, 
  "", 
  -----> "Content/BulkFeed/CompilationDownload.aspx?FeedPK=37265&FeedType=IA_FIRM_STATE", 
  false, 
  true))

----->识别的项目放在urlbody区域的明显位置,以获取其他内容。这些参数来自"国家投资顾问报告"按钮链接来源。

如果你真的不想将内容写入文件,你可以尝试我的alpha包中的一个非暴露函数,直接在R中膨胀gzip的原始内容:

devtools::install_git("https://gitlab.com/hrbrmstr/warc.gz")
raw_report <- warc:::gzuncompress(content(res, as="raw"), 50*1024*1024)
doc <- read_xml(raw_report)
...

最新更新