使用Selenium获取标签内的文本'ul'?



请帮助我找到解决方案,以获取"ul"标签中的文本。

我想获得用逗号分隔的信息,例如:"含有酶活性B族维生素,膳食补充剂,非转基因LE认证">

网站链接: https://ca.iherb.com/pr/Life-Extension-BioActive-Complete-B-Complex-60-Vegetarian-Capsules/67051

图片:在此处输入图片描述

这是 HTML 代码:

<ul>
<li>Contains Enzymatically Active B-Vitamins
</li>
<li>Dietary Supplement
</li>
<li>Non-GMO LE Certified
</li>
</ul>

这应该可以做到:

from selenium import webdriver
link = 'https://ca.iherb.com/pr/Life-Extension-BioActive-Complete-B-Complex-60-Vegetarian-Capsules/67051'
with webdriver.Chrome() as driver:
driver.get(link)
elements = ', '.join([item.text for item in driver.find_elements_by_css_selector("[itemprop='description'] > ul:nth-of-type(1) > li")])
print(elements)

输出:

Contains Enzymatically Active B-Vitamins, Dietary Supplement, Non-GMO LE Certified 

要提取文本,例如含有酶活性B族维生素,使用硒和python的膳食补充剂,您可以使用以下定位器策略之一:

  • 使用CSS_SELECTOR并打印列表:

    driver.get('https://ca.iherb.com/pr/Life-Extension-BioActive-Complete-B-Complex-60-Vegetarian-Capsules/67051')
    print([my_elem.text for my_elem in driver.find_elements_by_css_selector("div[itemprop='description']>ul li")])
    
  • 控制台输出:

    ['Contains Enzymatically Active B-Vitamins', 'Dietary Supplement', 'Non-GMO LE Certified ', 'Promotes healthy metabolism of glucose, fat & alcohol', 'Supports the healthy energy production your body needs', 'Encourages healthy organ function, cognitive health & more', 'Helps inhibit potential vitamin B deficiency']
    
  • 使用XPATH并打印逗号分隔字符串中的元素:

    driver.get('https://ca.iherb.com/pr/Life-Extension-BioActive-Complete-B-Complex-60-Vegetarian-Capsules/67051')
    print(', '.join([my_elem.text for my_elem in driver.find_elements_by_xpath("//div[@itemprop='description']/ul//li")]))
    
  • 控制台输出:

    Contains Enzymatically Active B-Vitamins, Dietary Supplement, Non-GMO LE Certified , Promotes healthy metabolism of glucose, fat & alcohol, Supports the healthy energy production your body needs, Encourages healthy organ function, cognitive health & more, Helps inhibit potential vitamin B deficiency
    

要提取文本,例如含有酶活性B族维生素膳食补充剂,理想情况下,您必须诱导WebDriverWaitvisibility_of_all_elements_located(),您可以使用以下定位器策略之一:

  • 使用CSS_SELECTOR并打印列表:

    driver.get('https://ca.iherb.com/pr/Life-Extension-BioActive-Complete-B-Complex-60-Vegetarian-Capsules/67051')
    print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[itemprop='description']>ul li")))])
    
  • 控制台输出:

    ['Contains Enzymatically Active B-Vitamins', 'Dietary Supplement', 'Non-GMO LE Certified ', 'Promotes healthy metabolism of glucose, fat & alcohol', 'Supports the healthy energy production your body needs', 'Encourages healthy organ function, cognitive health & more', 'Helps inhibit potential vitamin B deficiency']
    
  • 使用XPATH并在逗号分隔字符串中打印元素:

    driver.get('https://ca.iherb.com/pr/Life-Extension-BioActive-Complete-B-Complex-60-Vegetarian-Capsules/67051')
    print(', '.join([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@itemprop='description']/ul//li")))]))
    
  • 控制台输出:

    Contains Enzymatically Active B-Vitamins, Dietary Supplement, Non-GMO LE Certified , Promotes healthy metabolism of glucose, fat & alcohol, Supports the healthy energy production your body needs, Encourages healthy organ function, cognitive health & more, Helps inhibit potential vitamin B deficiency
    
  • 注意:您必须添加以下导入:

    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    

您可以随时获取所有元素li,从所有这些元素中获取文本并使用", ".join(elements)


代码为您的小示例

text = '''
<ul>
<li>Contains Enzymatically Active B-Vitamins
</li>
<li>Dietary Supplement
</li>
<li>Non-GMO LE Certified
</li>
</ul>'''
import selenium.webdriver
driver = selenium.webdriver.Firefox()
driver.get("data:text/html;charset=utf-8," + text)
elements = driver.find_elements_by_tag_name('li')
elements = [i.text for i in elements]
print(", ".join(elements)) 
from selenium import webdriver
from shutil import which
chrome_path = which('chromedriver.exe')
driver = webdriver.Chrome(executable_path = chrome_path)
li_eliments = driver.find_elements_by_tag_name('li')
elements = []
for e in li_eliments.text:
elements.append(e)
print(", ".join(elements)) 

好吧,Selenium 用于网络自动化,但数据抓取(就像您似乎正在尝试做的事情)更多地用于请求和美丽的汤。已经有关于使用Selenium的帖子,但是使用这些更容易,因此您不必像Selenium那样启动Web浏览器即可执行此操作。

r = requests.get("https://ca.iherb.com/pr/Life-Extension-BioActive-Complete-B-Complex-60-Vegetarian-Capsules/67051")
soup = BeautifulSoup(r.content, 'html.parser')
list_items = soup.find('div', itemprop="description")
found = str(re.findall(r'itemprop="description"><ul><li>(D+)', str(list_items)))

这只需要一秒钟,而其他方法可能需要更长的时间来加载浏览器并导航到网站以获取此信息。获得此信息并使用正则表达式查找适当的标签后,您可以使用正则表达式仅清理文本。

newfound = re.sub(r"</li>|[[']", '', found)
newfound2 = re.sub(r"<li>", ', ', newfound)
stripped = newfound2.split('\xa0', 1)[0]

itemprop="description"><ul><li>行和xa0行都来自查看页面的源代码并在那里找到列表元素。 以下是有关正则表达式的一些信息:https://www.guru99.com/python-regular-expressions-complete-tutorial.html

最新更新