刮掉两个未测试标签之间的所有内容



是否有可能刮掉两个未测试标签之间的所有内容?

例如:

<h3>Title 1<h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2<h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>

所以我想刮掉位于标题1下的内容直到标题2

现在我有这样的东西(问题是它刮掉了所有的东西,因为类都是相同的):

for i in soup.findAll("div",{"class":"div"}):
print(i.span.text)

现在我得到:

span1
span2
span3
span4

我想要得到:

span1
span2

我不知道这是否是这个问题的最佳解决方案,但是你可以拆分你的文本,只刮掉你需要的部分。

text = """
<h3>Title 1</h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2</h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
"""
sub_text = text.split(soup.find('h3', text="Title 2").string)[0]

这将得到:

'"n<h3>Title 1</h3>n<div class="div">n    <span class="span">span1</span>n    <label class="label">label1</label>n</div>n<div class="div">n    <span class="span">span2</span>n</div>n<h3>'

将该字符串转换为bs4对象后,您可以抓取所需的所有内容:

scrape_me = BeautifulSoup(sub_text, 'lxml')
for i in scrape_me.findAll("div", class_="div"):
print(i.span.text)
# -> span1 span2

一种方法是:

  1. 找到第二个class="span",然后向后导航,find_all_previous()div

  2. 标签是倒序的,所以使用reversed()函数…

  3. 查找<span>标签


from bs4 import BeautifulSoup
html = """
<h3>Title 1<h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2<h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
"""
soup = BeautifulSoup(html, "lxml")
for tag in reversed(
soup.select_one("div:nth-of-type(2) span.span").find_all_previous("div")
):
print(tag.find("span").text)

输出:

span1
span2

最新更新