我正在抓取一个类似的HTML
<div class="col-16 text"> ... </div>
<div class="col-16 image"> ... </div>
<div class="col-16"> ... </div>
<div class="col-16 text"> ... </div>
<div class="col-16 image"> ... </div>
<div class="col-16"> ... </div>
我正在使用这个
soup_object.find_all("div", {"class": "col-xs-12"})
我只想要一个带有类"的div;col-16";,但它正在返回所有div。
如何只选择一个div类;col-16";?
编辑
我想要这个
<div class="col-16"> ... </div>
<div class="col-16"> ... </div>
但我得到了这个
<div class="col-16 text"> ... </div>
<div class="col-16 image"> ... </div>
<div class="col-16"> ... </div>
<div class="col-16 text"> ... </div>
<div class="col-16 image"> ... </div>
<div class="col-16"> ... </div>
只需按类属性的数量过滤div。
例如:
from bs4 import BeautifulSoup
if __name__ == '__main__':
sample_html = """<div class="col-16 text"> ... </div>
<div class="col-16 image"> ... </div>
<div class="col-16"> ... </div>
<div class="col-16 text"> ... </div>
<div class="col-16 image"> ... </div>
<div class="col-16"> ... </div>"""
soup = BeautifulSoup(sample_html, "html.parser").find_all("div")
filtered = [div for div in soup if len(div.attrs["class"]) == 1]
print(filtered)
输出:
[<div class="col-16"> ... </div>, <div class="col-16"> ... </div>]
我认为这些会有所帮助:
BeautifulSoup webcrapping find_all((:查找完全匹配的
https://medium.com/@epicshane/using-beutifulsoup4 to find-class-excact-match-3e263a95e330
我尝试了以下解决方案:https://stackoverflow.com/a/22735249/13548379
from bs4 import BeautifulSoup
html_doc = """<div class="col-16 text"> ... </div>
<div class="col-16 image"> ... </div>
<div class="col-16"> ... </div>
<div class="col-16 text"> ... </div>
<div class="col-16 image"> ... </div>
<div class="col-16"> ... </div>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
#print(soup.prettify())
item = soup.find_all(lambda tag: tag.name == 'div' and
tag.get('class') == ['col-16'])
for x in item:
print(x.prettify())
结果是:
<div class="col-16">
...
</div>
<div class="col-16">
...
</div>