使用python中的selenium网络驱动程序滚动网页



我目前正在使用selenium网络驱动程序来解析此网页(https://startup-map.berlin/companies.startups/f/all_locations/allof_Berlin/data_type/allof_Verified)使用Python提取所有启动URL。我尝试了这篇文章中提到的所有相关方法:如何在python中使用硒网络驱动程序滚动网页?以及其他在线建议。

然而,这个网站并没有成功。它只加载了前25家初创公司。一些代码示例:

from time import sleep
from bs4 import BeautifulSoup
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
webdriver = webdriver.Chrome(executable_path='chromedriver')
# Write into csv file
filename = "startups_urls.csv"
f = open(BLD / "processed/startups_urls.csv", "w")
headers = "startups_urlsn"
f.write(headers)
url = "https://startup-map.berlin/companies.startups/f/all_locations/allof_Berlin/data_type/allof_Verified"
webdriver.get(url)
time.sleep(3)
# Get scroll height
last_height = webdriver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
webdriver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(3)
# Calculate new scroll height and compare with last scroll height
new_height = webdriver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
htmlSource = webdriver.page_source
page_soup = BeautifulSoup(htmlSource, "html.parser")
startups = page_soup.findAll("div", {"class": "type-element type-element--h3 hbox entity-name__name entity-name__name--black"})
if startups != []:
for startup in startups:
startups_href = startup.a["href"]
startups_url = "https://startup-map.berlin" + startups_href
open_file.write(startups_url + "n")
else:
print("NaN.") 

f.close()
driver.close()

有什么建议吗?非常感谢。

from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome()
driver.get("https://startup-map.berlin/companies.startups/f/all_locations/allof_Berlin/data_type/allof_Verified")
time.sleep(3)
driver.find_element_by_css_selector("#window-scrollbar .vertical-track")

a = driver.switch_to.active_element
a.send_keys(Keys.PAGE_DOWN)

只需使用Keys.PAGE-DOWN

您可以根据vertical-thumb元素的位置获得滚动过程的指示
因此,您可以获得其样式的translateY值,并将其与以前的值进行比较,类似于当前尝试将new_heightlast_height进行比较的方式
cssSelector:#window-scrollbar .vertical-thumb可以定位该元素
因此您可以执行以下操作:

element = webdriver.find_element_by_css_selector("#window-scrollbar .vertical-thumb")
attributeValue = element.get_attribute("style")

现在attributeValue字符串包含类似于以下的内容

position: relative; display: block; width: 100%; background-color: rgba(34, 34, 34, 0.6); border-radius: 4px; z-index: 1500; height: 30px; transform: translateY(847px);

现在,您可以找到包含translateY的子字符串,并从中提取数字,如下所示:

index = attributeValue.find('translateY(')
sub_string = attributeValue[index:]
new_y_value = int(filter(str.isdigit, sub_string))

如果int(filter(str.isdigit, sub_string))不能正常工作(虽然它应该(,请尝试使用

new_y_value = re.findall('d+', sub_string)

要使用re,必须首先通过导入

import re

最新更新