我正在尝试从SEC文件中获取.xml数据。它在第二个桌子中。但是,如果我进入一个没有.xml的页面,我想要html vers,首先&只有表。如果有两个,请帮助我了解如何迭代或跳过第一张桌子,如果只有一个桌子在第一个桌子中获得第一个a ['href']?
from urllib2 import urlopen
import requests
from bs4 import BeautifulSoup
tableCount = 0
linklist = [https://www.sec.gov/Archives/edgar/data/1070789/000149315217011092/0001493152-17-011092-index.htm, https://www.sec.gov/Archives/edgar/data/1592603/000139160917000254/0001391609-17-000254-index.htm]
for l in linklist:
html = urlopen(l)
soup = BeautifulSoup(html.read().decode('latin-1', 'ignore'),"lxml")
table = soup.findAll(class_='tableFile') # works for getting all .htm links
for item in table:
tableCount +=1
url = table[0].a["href"]
if table.count >= 1:
url = table[1].a["href"]
else:
url = table.a["href"]
在两种情况下,您始终需要从最后一个表中的信息
import requests
from bs4 import BeautifulSoup
urls = ['https://www.sec.gov/Archives/edgar/data/1070789/000149315217011092/0001493152-17-011092-index.htm',
'https://www.sec.gov/Archives/edgar/data/1592603/000139160917000254/0001391609-17-000254-index.htm']
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
tables = soup.findAll('table', class_='tableFile')
# assume xml table always comes after html one
table = tables[-1]
for a in table.findAll('a'):
print(a['href']) # you may filter out txt or xsd here