如何从<span>没有唯一类标识符的文本中抓取一些文本?



我是刮擦新手,所以请耐心等待。我有这个HTML代码,我想提取属性的类型,例如"公寓",没有。床位,例如2张床位和位置,例如"伯明翰"。我想将它们中的每一个都保存在一个列表中。问题是没有唯一的类标识符。

<div class="extra">
<span class="tablet-visible">
<span class="item"><label><i class="ouricon classified"></i><b></b></label>
<span>For Sale</span></span>
</span>
<span class="tablet-visible">
<span class="item"><label><i class="ouricon house"></i><b></b></label>
<span>Apartment</span></span>
</span>
<span class="">
<span class="item"><label><i class="ouricon bed"></i><b></b></label>
<span>2</span>
</span>
</span>
<span class="">
<span class="item"><label><i class="ouricon locationpin"></i><b></b></label>
<span>Birmingham</span>
</span>
</span> 
</div>

我尝试了这段代码,但当然这会打印 class=extra 中的所有文本,包括"待售",这不是我想要的。

results = requests.get(url)
soup = BeautifulSoup(results.text, "html.parser")
desc_div = soup.find_all('div', attrs={"data-itemid": True})
for property in desc_div:
extra = property.find('div', class_='extra')
print(extra.text.strip())

任何帮助将不胜感激。

由于For Sale在同一个标签和类中,只需将其过滤掉即可。

from bs4 import BeautifulSoup
html = """
<div class="extra">
<span class="tablet-visible">
<span class="item"><label><i class="ouricon classified"></i><b></b></label>
<span>For Sale</span></span>
</span>
<span class="tablet-visible">
<span class="item"><label><i class="ouricon house"></i><b></b></label>
<span>Apartment</span></span>
</span>
<span class="">
<span class="item"><label><i class="ouricon bed"></i><b></b></label>
<span>2</span>
</span>
</span>
<span class="">
<span class="item"><label><i class="ouricon locationpin"></i><b></b></label>
<span>Birmingham</span>
</span>
</span> 
</div>
"""
soup = BeautifulSoup(html, "html.parser").find_all("span", {"class": "item"})
print([i.text.strip() for i in soup if i.text.strip() != "For Sale"])

输出:

['Apartment', '2', 'Birmingham']

最新更新