我可以暂停selenium中的滚动功能,抓取当前数据,然后在脚本中继续滚动吗



我是一名从事刮擦项目的学生,我在完成脚本时遇到了困难,因为它用存储的所有数据填充了我计算机的内存。

它目前将我的所有数据存储到最后,所以我的解决方案是将碎片分解成更小的比特,然后定期写出数据,这样它就不会继续列出一个大列表,然后在最后写出。

为了做到这一点,我需要停止滚动方法,刮取加载的配置文件,写出我收集的数据,然后在不复制数据的情况下重复这个过程。如果有人能教我怎么做,我将不胜感激。谢谢你的帮助:(

这是我当前的代码:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from time import sleep
from selenium.common.exceptions import NoSuchElementException

Data = []
driver = webdriver.Chrome()
driver.get("https://directory.bcsp.org/")
count = int(input("Number of Pages to Scrape: "))
body = driver.find_element_by_xpath("//body") 
profile_count = driver.find_elements_by_xpath("//div[@align='right']/a")
while len(profile_count) < count:   # Get links up to "count"
body.send_keys(Keys.END)
sleep(1)
profile_count = driver.find_elements_by_xpath("//div[@align='right']/a")
for link in profile_count:   # Calling up links
temp = link.get_attribute('href')   # temp for
driver.execute_script("window.open('');")   # open new tab
driver.switch_to.window(driver.window_handles[1])   # focus new tab
driver.get(temp)
# scrape code
Name = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[1]/div[2]/div').text
IssuedBy = "Board of Certified Safety Professionals"
CertificationorDesignaationNumber = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[1]/td[3]/div[2]').text
CertfiedorDesignatedSince = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[3]/td[1]/div[2]').text
try:
AccreditedBy = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[5]/td[3]/div[2]/a').text
except NoSuchElementException:
AccreditedBy = "N/A"
try:
Expires = driver.find_element_by_xpath('/html/body/table/tbody/tr/td/table/tbody/tr/td[5]/div/table[1]/tbody/tr/td[3]/table/tbody/tr[5]/td[1]/div[2]').text
except NoSuchElementException:
Expires = "N/A"
info = Name, IssuedBy, CertificationorDesignaationNumber, CertfiedorDesignatedSince, AccreditedBy, Expires + "n"
Data.extend(info)
driver.close()
driver.switch_to.window(driver.window_handles[0])

with open("Spredsheet.txt", "w") as output:
output.write(','.join(Data))
driver.close()
Test.py
Displaying Test.py.

使用请求美化组尝试以下方法。在下面的脚本中,我使用了从网站本身提取的API URL,例如:-API URL

  1. 首先,它将为第一次迭代创建URL(参考第一个URL(,在.csv文件中添加标题和数据
  2. 第二次迭代,它将再次创建具有2个额外参数的URL(参考第二个URL(start_on_page=20&show_per_page=20其中start_on_page编号20在每次迭代时递增20,show_per_page=100默认为每次迭代提取100条记录,依此类推,直到所有数据转储到.csv文件。第二次迭代API URL
  3. 脚本是转储4件事数字,名称,位置和配置文件网址
  4. 在每次迭代中,数据都会附加到.csv文件中,因此您的内存问题将通过这种方法得到解决

在运行脚本之前,请不要忘记在要创建.csv文件的file_path变量中添加系统路径。

import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from bs4 import BeautifulSoup as bs
import csv
def scrap_directory_data():
list_of_credentials = []
file_path = ''
file_name = 'credential_list.csv'
count = 0
page_number = 0
page_size = 100
create_url = ''
main_url = 'https://directory.bcsp.org/search_results.php?'
first_iteration_url = 'first_name=&last_name=&city=&state=&country=&certification=&unauthorized=0&retired=0&specialties=&industries='
number_of_records = 0
csv_headers = ['#','Name','Location','Profile URL']

while True:
if count == 0:
create_url = main_url + first_iteration_url
print('-' * 100)
print('1 iteration URL created: ' + create_url)
print('-' * 100)
else:
create_url = main_url + 'start_on_page=' + str(page_number) + '&show_per_page=' + str(page_size) + '&' + first_iteration_url
print('-' * 100)
print('Other then first iteration URL created: ' + create_url)
print('-' * 100)
page = requests.get(create_url,verify=False)
extracted_text = bs(page.text, 'lxml')
result = extracted_text.find_all('tr')
if len(result) > 0:
for idx, data in enumerate(result):
if idx > 0:
number_of_records +=1
name = data.contents[1].text
location = data.contents[3].text
profile_url = data.contents[5].contents[0].attrs['href']
list_of_credentials.append({
'#':number_of_records,
'Name':name,
'Location': location,
'Profile URL': profile_url
})
print(data)
with open(file_path + file_name ,'a+') as cred_CSV:
csvwriter = csv.DictWriter(cred_CSV, delimiter=',',lineterminator='n',fieldnames=csv_headers)                    
if idx == 0 and count == 0:
print('Writing CSV header now...')
csvwriter.writeheader()
else:
for item in list_of_credentials:
print('Writing data rows now..')
print(item)
csvwriter.writerow(item)
list_of_credentials = []
count +=1
page_number +=20
scrap_directory_data()

最新更新