从html文档标签中提取文本



我正试图从这些文档中提取文本。

我只需要Item 1 header中的文本。

到目前为止我所尝试的如下所示

soup = BS(response.text,'html.parser')
startid = BS(response.css('tr:contains("Itemxa01"), tr:contains("Item 1."), *:contains("ITEM 1")')[0].css('a').get('')).find('a').attrs
endid = BS(response.css('tr:contains("Itemxa02"), tr:contains("Item 2."),*:contains("ITEM 2")')[0].css('a').get('')).find('a').attrs

html=''
for tag in soup.select('a',startid)[0].parent.next_siblings:
if soup.select('a',endid)[0].parent == tag:
break
else:
html += str(tag)
h = html2text.HTML2Text()
h.ignore_links = True
print(h.handle(html))

我只需要Item 1部分下的文本。

如果运行:

r = requests.get('https://www.sec.gov/Archives/edgar/data/0000001800/000104746915001377/a2222655z10-k.htm')
print(r.text[1532:(1532 + 571)])

输出为:

To allow for equitable access to all users, SEC reserves the right to limit requests originating from undeclared automated tools. Your request has been identified as part of a network of automated tools outside of the acceptable policy and will be managed until action is taken to declare your traffic.</p>nn<p>Please declare your traffic by updating your user agent to include company specific information.</p>nnn<p>For best practices on efficiently downloading information from SEC.gov, including the latest EDGAR filings, visit <a href="https://www.sec.gov/developer" '

如果你看https://www.sec.gov/developer链接到https://www.sec.gov/edgar/sec-api-documentation。

对于0000001800,您应该尝试https://data.sec.gov/submissions/CIK0000001800.json,其中包含…

{"cik":"1800","entityType":"operating","sic":"2834
","sicDescription":"Pharmaceutical Preparations","
insiderTransactionForOwnerExists":1,"insiderTransa
ctionForIssuerExists":1,"name":"ABBOTT LABORATORIE
S","tickers":["ABT"],"exchanges":["NYSE"],"ein":"3
60698440","description":"","website":"","investorW
ebsite":"","category":"Large accelerated filer","f
iscalYearEnd":"1231","stateOfIncorporation":"IL","
stateOfIncorporationDescription":"IL","addresses":
{"mailing":{"street1":"100 ABBOTT PARK ROAD","stre
et2":null,"city":"ABBOTT PARK","stateOrCountry":"I
L","zipCode":"60064-3500","stateOrCountryDescripti
on":"IL"},"business":{"street1":"100 ABBOTT PARK R
OAD","street2":null,"city":"ABBOTT PARK","stateOrC
ountry":"IL","zipCode":"60064-3500","stateOrCountr
yDescription":"IL"}},"phone":"2246676100","flags":
"","formerNames":[],"filings":{"recent":{"accessio
nNumber":["0001415889-21-004019","0001415889-21-00
4018","0001415889-21-003917","0001415889-21-003804
","0001104659-21-100055","0001415889-21-003773","0
001415889-21-003748","0001104659-21-094680","00014
15889-21-003516","0001415889-21-003514","000141588
9-21-003513","0001415889-21-003512","0001415889-21
-003509","0001415889-21-003503","0001415889-21-003
428","0001415889-21-003425","0001415889-21-003423"
,"0001415889-21-003418","0001104659-21-086325","00
01415889-21-002958","0001415889-21-002831","000141
5889-21-002830","0001104659-21-0763........

相关内容

  • 没有找到相关文章