我抓取了一个股票列表,并将项目附加到列表中,但由于我的bs4查询,这样做还添加了额外的html元素。
这是我的可复制代码:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
url = 'https://bullishbears.com/russell-2000-stocks-list/'
hdr = {'User-Agent': 'Mozilla/5.0'}
req = Request(url,headers=hdr)
page = urlopen(req)
soup = BeautifulSoup(page)
divTag = soup.find_all("div", {"class": "thrv_wrapper thrv_text_element"})
stock_list = []
for tag in divTag:
strongTags = tag.find_all("strong")
for tag in strongTags:
for x in tag:
stock_list.append(x)
从列表的结果来看,我对股票字符串格式感到满意,每个股票后面都有一个逗号(字符串列表(。正如您所看到的,我还得到了其他HTML元素,我希望删除<br/>
和<span>
。
stock_list
=
[<span data-css="tve-u-17078d9d4a6">RUSSELL 2000 STOCKS LIST</span>,
<strong><strong><strong><span data-css="tve-u-17031e9c4ac"> We provide you a list of Russell 2000 stocks and companies below</span><span data-css="tve-u-17031e9c4ad">. </span></strong></strong></strong>,
<strong><strong><span data-css="tve-u-17031e9c4ac"> We provide you a list of Russell 2000 stocks and companies below</span><span data-css="tve-u-17031e9c4ad">. </span></strong></strong>,
<strong><span data-css="tve-u-17031e9c4ac"> We provide you a list of Russell 2000 stocks and companies below</span><span data-css="tve-u-17031e9c4ad">. </span></strong>,
<span data-css="tve-u-17031e9c4ac"> We provide you a list of Russell 2000 stocks and companies below</span>,
<span data-css="tve-u-17031e9c4ad">. </span>,
'List of Russell 2000 Stocks & Updated Chart',
'IWM',
<br/>,
'SPSM',
<br/>,
'VTWO',
'/RTY',
<br/>,
'/M2K',
'AAN',
<br/>,
'AAOI',
<br/>,
'AAON',
<br/>,
'AAT',
<br/>,
'AAWW',
<br/>,
'AAXN',
<br/>,
'ABCB',
<br/>,
'ABEO',
<br/>,
'ABG',
<br/>,
'ABM',
<br/>,
'ABTX',
<br/>,
'AC',
<br/>,
'ACA',
<br/>,
'ACAD',
<br/>,
'ACBI',
<br/>,
'ACCO',
# More to the list but for brevity I removed the rest.
如何正确地微调我的bs4查询以仅获得股票列表?
您需要拆分价值,因为strong
标签中有多只股票
<strong>AAN<br>AAOI<br>AAON<br>AAT<br>....</strong>
代码
# better and easier using CSS selector
strongTags = soup.select('.tcb-col .thrv_wrapper.thrv_text_element strong')
stock_list = []
for s in strongTags:
# .decode_contents() to get innerHTML
stocks = s.decode_contents().split('<br/>');
for stock in stocks:
stock_list.append(stock)
print(stock_list)