如何使用美丽的汤来抓取SEC的Edgar数据库并接收愿望数据

提前为长问题道歉 - 我是 Python 的新手，我试图在相当具体的情况下尽可能明确。

我正在尝试例行从 SEC 文件中识别特定数据点，但我想自动化此操作，而不必手动搜索公司 CIK ID 和表格备案。到目前为止，我已经能够下载有关SEC在给定时间段内收到的所有文件的元数据。它看起来像这样：

index   cik         conm             type        date           path
0   0   1000045 NICHOLAS FINANCIAL INC  10-Q   2019-02-14   edgar/data/1000045/0001193125-19-039489.txt
1   1   1000045 NICHOLAS FINANCIAL INC  4   2019-01-15  edgar/data/1000045/0001357521-19-000001.txt
2   2   1000045 NICHOLAS FINANCIAL INC  4   2019-02-19  edgar/data/1000045/0001357521-19-000002.txt
3   3   1000045 NICHOLAS FINANCIAL INC  4   2019-03-15  edgar/data/1000045/0001357521-19-000003.txt
4   4   1000045 NICHOLAS FINANCIAL INC  8-K 2019-02-01  edgar/data/1000045/0001193125-19-024617.txt

尽管拥有所有这些信息，并且能够下载这些文本文件并查看基础数据，但我无法解析这些数据，因为它是 xbrl 格式，并且有点超出我的驾驶室。相反，我遇到了这个脚本(请从本网站 https://www.codeproject.com/Articles/1227765/Parsing-XBRL-with-Python 提供)：

from bs4 import BeautifulSoup
import requests
import sys
# Access page
cik = '0000051143'
type = '10-K'
dateb = '20160101'
# Obtain HTML for search page
base_url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type={}&dateb={}"
edgar_resp = requests.get(base_url.format(cik, type, dateb))
edgar_str = edgar_resp.text
# Find the document link
doc_link = ''
soup = BeautifulSoup(edgar_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile2')
rows = table_tag.find_all('tr')
for row in rows:
cells = row.find_all('td')
if len(cells) > 3:
if '2015' in cells[3].text:
doc_link = 'https://www.sec.gov' + cells[1].a['href']
# Exit if document link couldn't be found
if doc_link == '':
print("Couldn't find the document link")
sys.exit()
# Obtain HTML for document page
doc_resp = requests.get(doc_link)
doc_str = doc_resp.text
# Find the XBRL link
xbrl_link = ''
soup = BeautifulSoup(doc_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile', summary='Data Files')
rows = table_tag.find_all('tr')
for row in rows:
cells = row.find_all('td')
if len(cells) > 3:
if 'INS' in cells[3].text:
xbrl_link = 'https://www.sec.gov' + cells[2].a['href']
# Obtain XBRL text from document
xbrl_resp = requests.get(xbrl_link)
xbrl_str = xbrl_resp.text
# Find and print stockholder's equity
soup = BeautifulSoup(xbrl_str, 'lxml')
tag_list = soup.find_all()
for tag in tag_list:
if tag.name == 'us-gaap:stockholdersequity':
print("Stockholder's equity: " + tag.text)

只需运行此脚本即可完全按照我的意愿工作。它返回给定公司(在本例中为 IBM)的股东权益，然后我可以获取该值并将其写入 excel 文件。

我的两部分问题是：

我从上面的原始元数据表中获取了三个相关列(CIK、类型和日期)，并将其写入元组列表 - 我认为这就是它的名字 - 它看起来像这样 [('1009759'， 'D'， '20190215')，('1009891'， 'D'， '20190206'),...])。我如何获取这些数据，替换我找到的脚本的初始部分，并有效地循环访问它，以便我最终可以得到每个公司、备案和日期的所需值列表？
通常有更好的方法可以做到这一点吗？我认为会有某种 API 或 python 包来查询我感兴趣的数据。我知道表格 10-K 和表格 10-Q 有一些高级信息，但我在表格 D 中，这有点模糊。我只是想确保我有效地将时间花在最佳解决方案上。

谢谢你的帮助！

你需要定义一个函数，它基本上可以是你发布的大部分代码，并且该函数应该接受 3 个关键字参数(你的 3 个值)。然后，无需在代码中定义这三个值，只需传入这些值并返回结果即可。

然后，你拿起你创建的列表，并围绕它做一个简单的for循环，用这三个值来计算你定义的函数，然后对结果做一些事情。

def get_data(value1, value2, value3):
# your main code here but replace with your arguments above.
return content
for company in companies:
content = get_data(value1, value2, value3)
# do something with content

假设您有一个数据帧sec，其中包含上述文件列表正确命名的列，您首先需要从数据帧中提取相关信息到三个列表中：

cik = list(sec['cik'].values)
dat = list(sec['date'].values)
typ = list(sec['type'].values)

然后，使用插入的项目创建base_url并获取数据：

for c, t, d in zip(cik, typ, dat):
base_url = f"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={c}&type={t}&dateb={d}"
edgar_resp = requests.get(base_url)

然后从那里开始。

相关内容

最新更新

热门标签：