美汤:获取一系列 div



我刚刚发现了如何使用BeautifulSoup在python中处理网页。 有一个div列表,我想从中获取特定范围内的那些。该范围由具有h2子项的两个div定义。 我该怎么做?感谢您的支持!

编辑:我在下面添加了我的html代码的实际表示,而不是缺少标签的先前"简化"版本。 新代码显示了具有类foo-bar-details的根div。 嵌套是 9 个div标签。其中两个具有嵌套的h2标记。所有这 9 个div标签都包含深度嵌套在其中的img元素。我需要的是包含h2元素的那些div的每个img元素。如果应用于下面的 html 代码,预期结果将是:

<img src="../../images/123456_thumb.jpg" alt="Image 123456" title="Image 123456">
<img src="../../images/67890_thumb.JPG" alt="Image 67890 " title="Image 67890">

这是 html 代码:

<div class="foo-bar-details">
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>fsuhfsdf </strong> 
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><a href="../linkglossary0.pdf" class="link" title="test"><span class="icon-help"></span></a> 
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<a href="image-39826.html"><img src="../../images/39826_thumb.JPG" alt="Image 39826" title="Image 39826 "></a> 
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>JHFDFD </strong> 
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><a href="../linkglossary2.pdf" class="link" title="test"><span class="icon-help"></span></a> 
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<a href="image-223234.html"><img src="../../images/223234_thumb.JPG" alt="Image 223234" title="Image 223234 "></a> 
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>sdfsdf </strong> 
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><a href="../linkglossary1.pdf" class="link" title="test"><span class="icon-help"></span></a> 
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<a href="image-223823.html"><img src="../../images/223823_thumb.JPG" alt="Image 223823" title="Image 223823 "></a> 
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-4">
<h2 class="h3 margin-bottom-5">
Foo 
</h2>
<ul class="list-inline margin-0">
<li> <a href="#foo-feat-4-1">Foo feature</a> </li>
... 
</ul>
</div>
<div id="info-panel-header" class="padding-y-10 padding-x-40">
<div class="row">
<div class="col-se-6 element-info">
<div class="col-se-12">
<div class="row">
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<a href="image-123456.html"><img src="../../images/123456_thumb.jpg" alt="Image 123456" title="Image 123456"></a> 
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-wild-sand-bg" id="sec-feat-4-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Foo strin: </strong> 
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Barbar</strong><a href="../test.pdf" class="link" title="test"><span class="icon-help"></span></a> 
</p>
</div>
</div>
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Mine: </strong> 
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
TEST<a href="../link.pdf" class="my-link" title="title"><span class="icon-help"></span></a> 
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<a href="image-67890.html"><img src="../../images/67890_thumb.JPG" alt="Image 67890 " title="Image 67890"></a> 
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-5">
<h2 class="h3 margin-bottom-5">
Bar 
</h2>
<ul class="list-inline margin-0">
<li> <a href="#foo-feat-5-1">Bar feature</a> </li>
... 
</ul>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>fsuhfsdf </strong> 
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><a href="../linkglossary0.pdf" class="link" title="test"><span class="icon-help"></span></a> 
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<a href="image-39826.html"><img src="../../images/39826_thumb.JPG" alt="Image 39826" title="Image 39826 "></a> 
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>fsuhfsdf </strong> 
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><a href="../linkglossary0.pdf" class="link" title="test"><span class="icon-help"></span></a> 
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<a href="image-209876.html"><img src="../../images/209876_thumb.JPG" alt="Image 209876" title="Image 209876 "></a> 
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
</div>

这是一个涉及lxml.html的解决方案:

我们提取包含h2标签的第一个和最后一个div之间的所有div

import lxml.html

# HTML file saved as "file.html"
file_name = "file.html"
with open(file_name, 'r') as f:
tree = lxml.html.fromstring(f.read())
# all_div = tree.findall('div')
all_div = tree.find_class('foo-bar-details')[0].findall('div')
start, stop = None, None
for k, div in enumerate(all_div):
if div.findall('h2') and start is None:
print("Range starts at %d" % k)
start = k
continue
if div.findall('h2') and start is not None:
print("Range stops at %d" % k)
stop = k + 1  # add one as range stops at k - 1
continue
# div_list = all_div[start:stop]
img_list = [_.xpath('.//img') for _ in all_div[start:stop]]
print(img_list)
# [[], [<Element img at 0x20b58d73f40>], [<Element img at 0x20b58d73f90>], []]
# Or
img_list = [_.xpath('.//img/@src') for _ in all_div[start:stop]]
print(img_list)
# [[], ['../../images/123456_thumb.jpg'], ['../../images/67890_thumb.JPG'], []]

另一个涉及SimplifiedDoc的解决方案:

from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<div class="foo-bar-details">
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-4">
<h2 class="h3 margin-bottom-5">
Foo 
</h2>
<ul class="list-inline margin-0">
<li> <a href="#foo-feat-4-1">Foo feature</a> </li>
... 
</ul>
</div>
<div id="info-panel-header" class="padding-y-10 padding-x-40">Test 1</div>
<div class="padding-y-10 padding-x-40 gray-wild-sand-bg" id="foo-feat-4-1">Test 2</div>
<div class="padding-y-10 padding-x-40 " id="foo-feat-4-2">Test 3</div>
<div class="padding-y-10 padding-x-40 gray-wild-sand-bg" id="foo-feat-4-3">Test 4</div>
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-5">
<h2 class="h3 margin-bottom-5">
Bar 
</h2>
<ul class="list-inline margin-0">
<li> <a href="#foo-feat-5-1">Bar feature</a> </li>
... 
</ul>
</div>
</div>
'''
doc = SimplifiedDoc(html)
divs = doc.select('div.foo-bar-details').divs.contains('<h2')
print ([div.id for div in divs])
divs = doc.select('div.foo-bar-details').divs.notContains('<h2')
print ([div.id for div in divs])

结果:

['elem-4', 'elem-5']
['info-panel-header', 'foo-feat-4-1', 'foo-feat-4-2', 'foo-feat-4-3']

Simplifieddoc 库不依赖第三方库,更轻、更快捷,非常适合初学者。 这里有更多例子

这里

如果我理解正确,您想找到图像所属<img>标签和相应的<h2>

此示例(txt变量包含您问题的 HTML 代码段(:

from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
out = {}
for img in soup.select('div:has(h2) ~ div img'):
out.setdefault(img.find_previous('h2').get_text(strip=True), []).append(img['src'])
from pprint import pprint
pprint(out)

指纹:

{'Bar': ['../../images/39826_thumb.JPG', '../../images/209876_thumb.JPG'],
'Foo': ['../../images/123456_thumb.jpg', '../../images/67890_thumb.JPG']}

最新更新