漂亮的汤网页抓取CSS选择器



我正试图从EDGAR(SEC.gov(上的标准SEC文件中抓取11个字段,并将它们返回到一个简单的字典中。当我运行下面的代码时,其中7个字段工作正常,但其中4个字段(在代码中命名为"Director"、"Officer"、"Person"one_answers"Ticker"(返回了一个空列表值,尽管在页面上显示了这些字段中的实际文本,我不知道如何解决。我在Chrome中使用DevTools获得了这些字段的CSS选择器信息,并在我试图抓取的页面上查看了Elements选项卡。需要注意的一点是,这4个字段的CSS选择器比正常工作的字段长(即,描述页面上位置的"树"比其他字段长(,所以我觉得在指向这4个域时,一定有语法错误。

顺便说一句,我是Python的新手,在研究这一点的早期,我了解到使用Beautiful Soup时,CSS选择器引用必须使用"第n个类型"而不是"第n个子",所以我已经对代码进行了这些更改。

我不知道为什么这4个字段不会返回表单上显示的数据,而其他7个字段工作正常。如有任何帮助或指导,我们将不胜感激!

注意:我使用的是Python 3。

import bs4, requests, pprint
def getFormData(form4url):
res = requests.get(form4url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
# scrape the data from each field of the SEC Form 4 document. Each field is identified by its
# CSS selector from the web page's html (viewed using DevTools -> Elements tab in Chrome)
person = soup.select('body > table:nth-of-type(2) > tbody > tr:nth-of-type(1) > td:nth-of-type(1) > table:nth-of-type(2) > tbody > tr > td > a')
ticker = soup.select('body > table:nth-of-type(2) > tbody > tr:nth-of-type(1) > td:nth-of-type(2) > span.FormData')
director = soup.select('body > table:nth-of-type(2) > tbody > tr:nth-of-type(1) > td:nth-of-type(3) > table > tbody > tr:nth-of-type(1) > td:nth-of-type(1) > span')
officer = soup.select('body > table:nth-of-type(2) > tbody > tr:nth-of-type(1) > td:nth-of-type(3) > table > tbody > tr:nth-of-type(2) > td:nth-of-type(1)')
security = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(1) > span')
date = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(2) > span')
tCode = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(4)')
qtyTrans = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(6) > span.FormData')
transType = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(7) > span')
price = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(8) > span.FormData')
qtyAfter = soup.select('body > table:nth-of-type(3) > tbody > tr:nth-of-type(1) > td:nth-of-type(9) > span')
return {'Person':person,'Ticker':ticker,'Director':director,'Officer':officer, 
'Security':security,'Date':date, 'Trans Code':tCode, 'Quantity':qtyTrans, 
'Trans Type':transType,'Price':price,'Qty After':qtyAfter}
# this is the website to scrape
userLink = 'https://www.sec.gov/Archives/edgar/data/1539638/000120919118040737/xslF345X03/doc4.xml'
dataDict = getFormData(userLink)
# following just cleans up values in dict by removing html from scraped fields (lists of
# strings), leaving only the visible text   
for key,value in dataDict.items():
if len(value) > 0:
dataDict[key] = dataDict[key][0].text.strip()      
pprint.pprint(dataDict)

PersonTickerDirectorOfficer的正确CSS选择器为:

person: "table:nth-of-type(2) > tr > td > table"
ticker: "table:nth-of-type(2) > tr > td:nth-of-type(2) > span:nth-of-type(2)"
director: "table:nth-of-type(2) > tr > td:nth-of-type(3) > table > tr > td"
officer: "table:nth-of-type(2) > tr > td:nth-of-type(3) > table > tr:nth-of-type(2) > td"

以下是使用Node.js、x-ray的演示,以及您提供的示例链接:https://codesandbox.io/s/j489wlyzmw

由于未设置Officer,因此演示不会返回Officer的任何值。

最新更新