如何使用Beautifulsoup或Selenium提取图像的标题和src

所以我有所有的页面内容：

content = driver.page_source
soup = BeautifulSoup(content, features="html.parser")

然后，我做了这个：

idioma = soup.select(".idioma > span:nth-child(1)")

这给了我这个：

[<span>
<img alt="Idioma Aleman" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ale.png" title="Idioma Aleman"/>
<img alt="Idioma Chino-tradicional" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/chi.png" title="Idioma Chino-tradicional"/>
<img alt="Idioma Coreano" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/cor.png" title="Idioma Coreano"/>
<img alt="Idioma Español" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/esp.png" title="Idioma Español"/>
<img alt="Idioma Español-latino" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/esp.png" title="Idioma Español-latino"/>
<img alt="Idioma Frances" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/fra.png" title="Idioma Frances"/>
<img alt="Idioma Ingles" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ing.png" title="Idioma Ingles"/>
<img alt="Idioma Italiano" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ita.png" title="Idioma Italiano"/>
<img alt="Idioma Portugues" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/por.png" title="Idioma Portugues"/>
<img alt="Idioma Ruso" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/rus.png" title="Idioma Ruso"/>
</span>]

当我这样做以获得标题时：

idioma = [''.join(elem.find('img')['title']) for elem in idioma if elem]

我只得到了第一个。

['Idioma Aleman']

为什么我不让每个人都来？

为什么不获得所有标题

这是因为在idioma中只有一个元素，并且使用find()只得到第一个匹配。

你能做的是这样的事情：

idioma = [''.join(elem['title']) for elem in idioma.findAll('img')]
print (idioma)

输出

['Idioma Aleman', 'Idioma Chino-tradicional', 'Idioma Coreano', 'Idioma Español', 'Idioma Español-latino', 'Idioma Frances', 'Idioma Ingles', 'Idioma Italiano', 'Idioma Portugues', 'Idioma Ruso']

基于注释的附加工作示例

import bs4
content ='''<span>
<img alt="Idioma Aleman" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ale.png" title="Idioma Aleman"/>
<img alt="Idioma Chino-tradicional" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/chi.png" title="Idioma Chino-tradicional"/>
<img alt="Idioma Coreano" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/cor.png" title="Idioma Coreano"/>
<img alt="Idioma Español" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/esp.png" title="Idioma Español"/>
<img alt="Idioma Español-latino" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/esp.png" title="Idioma Español-latino"/>
<img alt="Idioma Frances" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/fra.png" title="Idioma Frances"/>
<img alt="Idioma Ingles" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ing.png" title="Idioma Ingles"/>
<img alt="Idioma Italiano" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ita.png" title="Idioma Italiano"/>
<img alt="Idioma Portugues" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/por.png" title="Idioma Portugues"/>
<img alt="Idioma Ruso" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/rus.png" title="Idioma Ruso"/>
</span>'''
soup = bs4.BeautifulSoup(content)

以下是区别：

idiomaSpan = soup.select_one('span')
idioma = [''.join(elem['title']) for elem in idiomaSpan.find_all('img')]
print (idioma)

要使用Selenium和python从所有<span>中提取title和src属性，必须诱导WebDriverWait等待visibility_of_all_elements_located()，并且可以使用以下定位器策略之一：

将CSS_SELECTOR用于标题：

print([my_elem.get_attribute("title") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".idioma > span:nth-child(1) img.post_flagen[alt^='Idioma']")))])

将XPATH用于src:

print([my_elem.get_attribute("src") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//*[contains(@class, 'idioma')]//span//img[starts-with(@alt, 'Idioma') and @class='post_flagen']")))])

注意：您必须添加以下导入：

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

相关内容

最新更新

热门标签：