我的代码运行良好,并打印所有行的标题,但带有下拉列表的行除外。
例如,如果单击第4行,则会出现下拉列表。我实现了一个"尝试",理论上点击下拉列表,然后拉标题。
但是,当我执行click((并尝试打印时,对于具有这些下拉列表的行,它们不会打印。
预期输出-打印所有标题,包括下拉列表中的标题。
一位用户在StackOverFlowAnswer链接上提交了一个答案,但他的答案格式不同,我不知道如何添加日期、时间、椅子或顶部写着";"按需";他的方法
任何方法都将受到赞赏,希望放入数据帧中。感谢
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.action_chains import ActionChains
import time
driver = webdriver.Chrome()
actions = ActionChains(driver)
driver.get('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
time.sleep(4)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
new_titles = set()
productlist=driver.find_elements_by_xpath("//div[@class='card item-container session']")
for property in productlist:
actions.move_to_element_with_offset(property,0,0).perform()
time.sleep(4.5)
sessiontitle=property.find_element_by_xpath(".//h4[@class='session-title card-title']").text
#print(sessiontitle)
ifDropdown=property.find_elements_by_xpath(".//*[@class='item-expand-action expand']")
if(ifDropdown):
ifDropdown[0].click()
time.sleep(4)
open_titles = driver.find_elements_by_class_name('card-title')
for open_title in open_titles:
title = open_title.text
if(title not in new_titles):
print(title)
new_titles.add(title)
问题出在driver.find_elements_by_class_name('item-expand-action expand')
命令上。find_elements_by_class_name('item-expand-action expand')
定位器错误。这些web元素具有多个类名。要定位这些元素,可以使用css_selector或XPath
此外,由于有几个元素带有下拉菜单,要对它们执行单击,您应该对它们进行迭代。不能对web元素列表执行.click()
所以你的代码应该是这样的:
ifDropdown=driver.find_elements_by_css_selector('.item-expand-action.expand')
for drop_down in ifDropdown:
drop_down.click()
time.sleep(0.5)
除了上面的css_selector,您还可以使用XPath:
ifDropdown=driver.find_elements_by_xpath('//a[@class="item-expand-action expand"]')
UPD
如果您想打印添加的新标题,您可以这样做:
ifDropdown=driver.find_elements_by_css_selector('.item-expand-action.expand')
for drop_down in ifDropdown:
drop_down.click()
time.sleep(0.5)
newTitles=driver.find_elements_by_class_name('card-title')
for new_title in newTitles:
print(new_title.text)
在这里,在展开所有的下拉元素后,我得到了所有的新标题,然后在该列表上迭代,打印每个元素的文本driver.find_elements_by_class_name
返回一个web元素列表。您不能在列表上应用.text
,您必须迭代列表元素,每次都要获得每个元素的文本
UPD2
整个代码打开下拉列表并打印其内部标题可以是这样的:
我用Selenium来做这件事,而不是与bs4混合。
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.common.action_chains import ActionChains
import time
driver = webdriver.Chrome()
actions = ActionChains(driver)
driver.get('https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list')
time.sleep(4)
page_source = driver.page_source
soup = BeautifulSoup(page_source,'html.parser')
new_titles = set()
productlist=driver.find_elements_by_xpath("//div[@class='card item-container session']")
for property in productlist:
actions.move_to_element(property).perform()
time.sleep(0.5)
sessiontitle=property.find_element_by_xpath(".//h4[@class='session-title card-title']").text
print(sessiontitle)
ifDropdown=property.find_elements_by_xpath(".//*[@class='item-expand-action expand']")
if(ifDropdown):
ifDropdown[0].click()
time.sleep(4)
open_titles = driver.find_elements_by_class_name('card-title')
for open_title in open_titles:
title = open_title.text
if(title not in new_titles):
print(title)
new_titles.add(title)
我在这里检查是否有下拉列表。如果是,我打开它。然后获取所有当前打开的标题。根据每个这样的标题,我验证它是新的还是以前打开过的。如果标题是新的,不存在于集合中,我会打印它并将其添加到集合中。
要获取所有数据,包括日期、时间、椅子,只能使用requests
/BeautifulSoup
。不需要Selenium
。
import requests
import pandas as pd
from bs4 import BeautifulSoup
data = []
url = "https://cslide.ctimeetingtech.com/esmo2021/attendee/confcal/session/list?p={}"
for page in range(1, 5): # <-- Increase number of pages here
with requests.Session() as session:
soup = BeautifulSoup(session.get(url.format(page)).content, "html.parser")
for card in soup.select("div.card-block"):
title = card.find(class_="session-title card-title").get_text()
date = card.select_one(".internal_date div.property").get_text(strip=True)
time = card.select_one(".internal_time div.property").get_text()
try:
chairs = card.select_one(".persons").get_text(strip=True)
except AttributeError:
chairs = "N/A"
data.append({"title": title, "date": date, "time": time, "chairs": chairs})
df = pd.DataFrame(data)
print(df.to_string())
输出(截断(:
title date time chairs
0 Educational sessions on-demand Thu, 16.09.2021 08:30 - 09:40 N/A
1 Special Symposia on-demand Thu, 16.09.2021 12:30 - 13:40 N/A
2 Multidisciplinary sessions on-demand Thu, 16.09.2021 16:30 - 17:40 N/A
3 MSD - Homologous Recombination Deficiency: BRCA and beyond Fri, 17.09.2021 08:45 - 09:55 Frederique Penault-Llorca(Clermont-Ferrand, France)
4 Servier - The clinical value of IDH inhibition in cholangiocarcinoma Fri, 17.09.2021 08:45 - 10:15 Arndt Vogel(Hannover, Germany)Angela Lamarca(Manchester, United Kingdom)
5 AstraZeneca - Redefining Breast Cancer – Biology to Therapy Fri, 17.09.2021 08:45 - 10:15 Ian Krop(Boston, United States of America)