硒/美容汤-网站浏览这个领域



我的代码运行良好,并打印所有行的标题,但带有下拉列表的行除外。

例如,如果单击第4行,则会出现下拉列表。我实现了一个"尝试",理论上点击下拉列表,然后拉标题。

但是,当我执行click((并尝试打印时,对于具有这些下拉列表的行,它们不会打印。

预期输出-打印所有标题,包括下拉列表中的标题。

一位用户在StackOverFlowAnswer链接上提交了一个答案,但他的答案格式不同,我不知道如何添加日期、时间、椅子或顶部写着";"按需";他的方法

任何方法都将受到赞赏,希望放入数据帧中。感谢

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.action_chains import ActionChains
import time
driver = webdriver.Chrome()
actions = ActionChains(driver)
driver.get('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
time.sleep(4)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
new_titles = set()
productlist=driver.find_elements_by_xpath("//div[@class='card item-container session']")
for property in productlist:
actions.move_to_element_with_offset(property,0,0).perform()
time.sleep(4.5)
sessiontitle=property.find_element_by_xpath(".//h4[@class='session-title card-title']").text
#print(sessiontitle)
ifDropdown=property.find_elements_by_xpath(".//*[@class='item-expand-action expand']")
if(ifDropdown):
ifDropdown[0].click()
time.sleep(4)
open_titles = driver.find_elements_by_class_name('card-title')
for open_title in open_titles:
title = open_title.text
if(title not in new_titles):
print(title)
new_titles.add(title)

问题出在driver.find_elements_by_class_name('item-expand-action expand')命令上。find_elements_by_class_name('item-expand-action expand')定位器错误。这些web元素具有多个类名。要定位这些元素,可以使用css_selector或XPath
此外,由于有几个元素带有下拉菜单,要对它们执行单击,您应该对它们进行迭代。不能对web元素列表执行.click()
所以你的代码应该是这样的:

ifDropdown=driver.find_elements_by_css_selector('.item-expand-action.expand')
for drop_down in ifDropdown:
drop_down.click()
time.sleep(0.5)

除了上面的css_selector,您还可以使用XPath:

ifDropdown=driver.find_elements_by_xpath('//a[@class="item-expand-action expand"]')

UPD
如果您想打印添加的新标题,您可以这样做:

ifDropdown=driver.find_elements_by_css_selector('.item-expand-action.expand')
for drop_down in ifDropdown:
drop_down.click()
time.sleep(0.5)
newTitles=driver.find_elements_by_class_name('card-title')
for new_title in newTitles:
print(new_title.text)

在这里,在展开所有的下拉元素后,我得到了所有的新标题,然后在该列表上迭代,打印每个元素的文本
driver.find_elements_by_class_name返回一个web元素列表。您不能在列表上应用.text,您必须迭代列表元素,每次都要获得每个元素的文本
UPD2
整个代码打开下拉列表并打印其内部标题可以是这样的:
我用Selenium来做这件事,而不是与bs4混合。

from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.action_chains import ActionChains
import time
driver = webdriver.Chrome()
actions = ActionChains(driver)
driver.get('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
time.sleep(4)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
new_titles = set()
productlist=driver.find_elements_by_xpath("//div[@class='card item-container session']")
for property in productlist:
actions.move_to_element(property).perform()
time.sleep(0.5)
sessiontitle=property.find_element_by_xpath(".//h4[@class='session-title card-title']").text
print(sessiontitle)
ifDropdown=property.find_elements_by_xpath(".//*[@class='item-expand-action expand']")
if(ifDropdown):
ifDropdown[0].click()
time.sleep(4)
open_titles = driver.find_elements_by_class_name('card-title')
for open_title in open_titles:
title = open_title.text
if(title not in new_titles):
print(title)
new_titles.add(title)

我在这里检查是否有下拉列表。如果是,我打开它。然后获取所有当前打开的标题。根据每个这样的标题,我验证它是新的还是以前打开过的。如果标题是新的,不存在于集合中,我会打印它并将其添加到集合中。

要获取所有数据,包括日期、时间、椅子,只能使用requests/BeautifulSoup。不需要Selenium

import requests
import pandas as pd
from bs4 import BeautifulSoup

data = []
url = "https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list?p={}"
for page in range(1, 5):  # <-- Increase number of pages here
with requests.Session() as session:
soup = BeautifulSoup(session.get(url.format(page)).content, "html.parser")
for card in soup.select("div.card-block"):
title = card.find(class_="session-title card-title").get_text()
date = card.select_one(".internal_date div.property").get_text(strip=True)
time = card.select_one(".internal_time div.property").get_text()
try:
chairs = card.select_one(".persons").get_text(strip=True)
except AttributeError:
chairs = "N/A"
data.append({"title": title, "date": date, "time": time, "chairs": chairs})
df = pd.DataFrame(data)
print(df.to_string())

输出(截断(:

                                                                                                   title             date           time                                                                    chairs
0                                                                                                                                                Educational sessions on-demand  Thu, 16.09.2021  08:30 - 09:40                                                                       N/A
1                                                                                                                                                    Special Symposia on-demand  Thu, 16.09.2021  12:30 - 13:40                                                                       N/A
2                                                                                                                                          Multidisciplinary sessions on-demand  Thu, 16.09.2021  16:30 - 17:40                                                                       N/A
3                                                                                                                    MSD - Homologous Recombination Deficiency: BRCA and beyond  Fri, 17.09.2021  08:45 - 09:55                       Frederique Penault-Llorca(Clermont-Ferrand, France)
4                                                                                                          Servier - The clinical value of IDH inhibition in cholangiocarcinoma  Fri, 17.09.2021  08:45 - 10:15  Arndt Vogel(Hannover, Germany)Angela Lamarca(Manchester, United Kingdom)
5                                                                                                                   AstraZeneca - Redefining Breast Cancer – Biology to Therapy  Fri, 17.09.2021  08:45 - 10:15                                Ian Krop(Boston, United States of America)

最新更新