如何使用Beautifulsoup或Selenium提取图像的标题和src



所以我有所有的页面内容:

content = driver.page_source
soup = BeautifulSoup(content, features="html.parser")

然后,我做了这个:

idioma = soup.select(".idioma > span:nth-child(1)")

这给了我这个:

[<span>
<img alt="Idioma Aleman" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ale.png" title="Idioma Aleman"/>
<img alt="Idioma Chino-tradicional" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/chi.png" title="Idioma Chino-tradicional"/>
<img alt="Idioma Coreano" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/cor.png" title="Idioma Coreano"/>
<img alt="Idioma Español" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/esp.png" title="Idioma Español"/>
<img alt="Idioma Español-latino" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/esp.png" title="Idioma Español-latino"/>
<img alt="Idioma Frances" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/fra.png" title="Idioma Frances"/>
<img alt="Idioma Ingles" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ing.png" title="Idioma Ingles"/>
<img alt="Idioma Italiano" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ita.png" title="Idioma Italiano"/>
<img alt="Idioma Portugues" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/por.png" title="Idioma Portugues"/>
<img alt="Idioma Ruso" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/rus.png" title="Idioma Ruso"/>
</span>]

当我这样做以获得标题时:

idioma = [''.join(elem.find('img')['title']) for elem in idioma if elem]

我只得到了第一个。

['Idioma Aleman']

为什么我不让每个人都来?

为什么不获得所有标题

这是因为在idioma中只有一个元素,并且使用find()只得到第一个匹配。

你能做的是这样的事情:

idioma = [''.join(elem['title']) for elem in idioma.findAll('img')]
print (idioma)

输出

['Idioma Aleman', 'Idioma Chino-tradicional', 'Idioma Coreano', 'Idioma Español', 'Idioma Español-latino', 'Idioma Frances', 'Idioma Ingles', 'Idioma Italiano', 'Idioma Portugues', 'Idioma Ruso']

基于注释的附加工作示例

import bs4
content ='''<span>
<img alt="Idioma Aleman" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ale.png" title="Idioma Aleman"/>
<img alt="Idioma Chino-tradicional" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/chi.png" title="Idioma Chino-tradicional"/>
<img alt="Idioma Coreano" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/cor.png" title="Idioma Coreano"/>
<img alt="Idioma Español" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/esp.png" title="Idioma Español"/>
<img alt="Idioma Español-latino" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/esp.png" title="Idioma Español-latino"/>
<img alt="Idioma Frances" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/fra.png" title="Idioma Frances"/>
<img alt="Idioma Ingles" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ing.png" title="Idioma Ingles"/>
<img alt="Idioma Italiano" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/ita.png" title="Idioma Italiano"/>
<img alt="Idioma Portugues" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/por.png" title="Idioma Portugues"/>
<img alt="Idioma Ruso" class="post_flagen" src="https://www.gamestorrents.nu/wp-content/themes/GamesTorrent/css/images/flags/rus.png" title="Idioma Ruso"/>
</span>'''
soup = bs4.BeautifulSoup(content)

以下是区别:

idiomaSpan = soup.select_one('span')
idioma = [''.join(elem['title']) for elem in idiomaSpan.find_all('img')]
print (idioma)

要使用Selenium和python从所有<span>中提取titlesrc属性,必须诱导WebDriverWait等待visibility_of_all_elements_located(),并且可以使用以下定位器策略之一:

  • CSS_SELECTOR用于标题

    print([my_elem.get_attribute("title") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".idioma > span:nth-child(1) img.post_flagen[alt^='Idioma']")))])
    
  • XPATH用于src:

    print([my_elem.get_attribute("src") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//*[contains(@class, 'idioma')]//span//img[starts-with(@alt, 'Idioma') and @class='post_flagen']")))])
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

最新更新