将SRC属性与python中的汤返回隔离

我使用Python3和BeautifulSoup从网页中获取特定的div。我的最终目标是从这个div中获取img src的url，这样我就可以将它传递给pytesseract以从图像中获取文本。

img没有任何类或唯一标识符，所以我不知道如何每次都使用BeautifulSoup来获取此图像。还有其他一些图像，它们的顺序每天都在变化。因此，我只得到了图像周围的整个div。div信息不会改变并且是唯一的，所以我的代码看起来像这样：

weather_today = soup.find("div", {"id": "weather_today_content"})

因此，我的脚本当前返回以下内容：

<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>

现在我只需要弄清楚如何将src拉入字符串，这样我就可以将其传递给pytesseract进行下载，并使用ocr来拉取更多信息。

我不熟悉regex，但有人告诉我这是最好的方法。如有任何协助，我们将不胜感激。非常感谢。

在找到的"div"元素中找到"img"元素，然后从中读取属性"src"。

from bs4 import BeautifulSoup
html ="""
<html><body>
<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>
</body></html>
"""
soup = BeautifulSoup(html, 'html.parser')
weather_today = soup.find("div", {"id": "weather_today_content"})
print (weather_today.find('img')['src'])

输出：

/database/img/weather_today.jpg?ver=2018-08-01

您可以使用CSS选择器，它是在BeautifulSoup(方法select()和select_one()(中构建的：

data = """<div class="style3" id="weather_today_content">
<img alt="" src="/database/img/weather_today.jpg?ver=2018-08-01" style="width: 400px"/>
</div>"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')    
print(soup.select_one('div#weather_today_content img')['src'])

打印：

/database/img/weather_today.jpg?ver=2018-08-01

选择器div#weather_today_content img意味着用id=weather_today_content选择<div>，并且在该<div>中选择<img>。

相关内容

最新更新

热门标签：