是否有可能刮掉两个未测试标签之间的所有内容?
例如:
<h3>Title 1<h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2<h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
所以我想刮掉位于标题1下的内容直到标题2。
现在我有这样的东西(问题是它刮掉了所有的东西,因为类都是相同的):
for i in soup.findAll("div",{"class":"div"}):
print(i.span.text)
现在我得到:
span1
span2
span3
span4
我想要得到:
span1
span2
我不知道这是否是这个问题的最佳解决方案,但是你可以拆分你的文本,只刮掉你需要的部分。
text = """
<h3>Title 1</h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2</h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
"""
sub_text = text.split(soup.find('h3', text="Title 2").string)[0]
这将得到:
'"n<h3>Title 1</h3>n<div class="div">n <span class="span">span1</span>n <label class="label">label1</label>n</div>n<div class="div">n <span class="span">span2</span>n</div>n<h3>'
将该字符串转换为bs4
对象后,您可以抓取所需的所有内容:
scrape_me = BeautifulSoup(sub_text, 'lxml')
for i in scrape_me.findAll("div", class_="div"):
print(i.span.text)
# -> span1 span2
一种方法是:
-
找到第二个
class="span"
,然后向后导航,find_all_previous()
和div
。 -
标签是倒序的,所以使用
reversed()
函数… -
查找
<span>
标签
from bs4 import BeautifulSoup
html = """
<h3>Title 1<h3>
<div class="div">
<span class="span">span1</span>
<label class="label">label1</label>
</div>
<div class="div">
<span class="span">span2</span>
</div>
<h3>Title 2<h3>
<div class="div">
<span class="span">span3</span>
<label class="label">label2</label>
</div>
<div id="div">
<span id="span">span4</span>
</div>
"""
soup = BeautifulSoup(html, "lxml")
for tag in reversed(
soup.select_one("div:nth-of-type(2) span.span").find_all_previous("div")
):
print(tag.find("span").text)
输出:
span1
span2