如何让我们的网络抓取脚本检查两种场景,但只执行所需的一种



我在网站上抓取了一些数据,这是我的脚本:

import warnings
warnings.filterwarnings("ignore")
import re
import requests
from requests import get
from bs4 import BeautifulSoup
import os
import pandas as pd
import numpy as np
import shutil
from selenium import webdriver
from selenium.webdriver.support.select import Select
from selenium.webdriver.support.ui import WebDriverWait     
from selenium.webdriver.common.by import By     
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
'Referer': 'https://www.espncricinfo.com/',
'Upgrade-Insecure-Requests': '1',
'Connection': 'keep-alive',
'Pragma': 'no-cache',
'Cache-Control': 'no-cache',
}

PATH = "driverchromedriver.exe"
options = webdriver.ChromeOptions() 

options.add_argument("--disable-gpu")
#options.add_argument('enable-logging')
options.add_argument("start-maximized")
#options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options, executable_path=PATH)
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36'})
url = 'https://www.boursorama.com/'
driver.get(url)
cookie = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, '//*[@id="didomi-notice-agree-button"]')))
try:
cookie.click()
except:
pass
df = pd.read_excel('liste.xlsx')
df2 = pd.DataFrame(df)
df3 = df2['Entreprises'].values.tolist()
currencies = []
for i in df3:
try :
print(i)
searchbar = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, 'html/body/div[6]/div[3]/div[2]/ol/li[1]/button')))
searchbar.click()
searchbar2 = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[6]/div[1]/div[2]/form/div/input')))
searchbar2.click()
searchbar2.send_keys(i + 'n') 
time.sleep(2)
links = driver.find_elements_by_xpath('//*[@id="main-content"]/div/div/div[4]/div[1]/div[3]/div/div/div[2]/div[1]/div/div[3]/div/div[1]/div/table/tbody/tr[1]/td[1]/div/div[2]/a')
for k in links:
data = k.get_attribute("href")
results = requests.get(data)
soup = BeautifulSoup(results.text, "html.parser")
currency = soup.find('span', class_= 'c-instrument c-instrument--last').text
currencies.append(currency)
except :
print(i)
searchbar = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, 'html/body/div[6]/div[3]/div[2]/ol/li[1]/button')))
searchbar.click()
searchbar2 = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[6]/div[1]/div[2]/form/div/input')))
searchbar2.click()
searchbar2.send_keys(i + 'n') 
time.sleep(2)
url2 = driver.current_url
results = requests.get(url2)
soup = BeautifulSoup(results.text, "html.parser")
currency = soup.find('span', class_= 'c-instrument c-instrument--last').text
currencies.append(currency)
print(currencies)

liste.xlsx只是一个excel文件,它为我的循环提供了企业名称:

列出

这是我的输出:

TotalEnergies
TotalEnergies
Engie
Engie
BNP
BNP
['45.59', '11.07', '49.03']

我不明白,我的脚本似乎做了tryexcept。我有3个输出,但它打印两次每个企业。我的目标是:如果需要执行try,否则执行except。

我可以改进我的代码使其只执行一个吗?需要的。

因为有时在搜索企业时,你需要更具体,网站会为你提供一些替代方案,因此代码为:

try :
print(i)
searchbar = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, 'html/body/div[6]/div[3]/div[2]/ol/li[1]/button')))
searchbar.click()
searchbar2 = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[6]/div[1]/div[2]/form/div/input')))
searchbar2.click()
searchbar2.send_keys(i + 'n') 
time.sleep(2)
links = driver.find_elements_by_xpath('//*[@id="main-content"]/div/div/div[4]/div[1]/div[3]/div/div/div[2]/div[1]/div/div[3]/div/div[1]/div/table/tbody/tr[1]/td[1]/div/div[2]/a')
for k in links:
data = k.get_attribute("href")

results = requests.get(data)
soup = BeautifulSoup(results.text, "html.parser")
currency = soup.find('span', class_= 'c-instrument c-instrument--last').text
currencies.append(currency)

有时,你在搜索栏上写下正确的名字,网站就会立即出现在所需的页面上,因此代码为:

except :
print(i)
searchbar = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, 'html/body/div[6]/div[3]/div[2]/ol/li[1]/button')))
searchbar.click()
searchbar2 = WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH, '/html/body/div[6]/div[1]/div[2]/form/div/input')))
searchbar2.click()
searchbar2.send_keys(i + 'n') 
time.sleep(2)
url2 = driver.current_url
results = requests.get(url2)
soup = BeautifulSoup(results.text, "html.parser")
currency = soup.find('span', class_= 'c-instrument c-instrument--last').text
currencies.append(currency)

但是,如何让脚本检查两个场景,但只执行所需的一个?提高时间表现?

"我的目标是:如果需要执行try,否则执行except">

这正是它正在做的。我建议研究一下如何调试代码。你可以一行一行地运行它,遵循逻辑,你就会看到发生了什么。

当你做try/except时;trys";以执行CCD_ 5块中的脚本。如果成功,则跳过except块。如果它在try块中的某个时刻失败,那么它就会执行异常脚本。

它之所以看起来同时运行两者,是因为从技术上讲,正如我上面所描述的,它确实同时运行两者。由于print()语句的位置,您将两次看到此打印。

它进入try块,然后在开始时用print(i)打印i。在print(i)之后的try块中的某个点,引发错误/异常,然后它转到except块,在那里,它再次在该块的开头打印带有print(i)的i。

如果您希望它查找条件并只执行您想要的条件,那么您需要使用if块来检查条件,而不是try/except

话虽如此,与使用Selenium进行渲染相比,从源中获取数据要高效得多。您还可以获得更多的数据。我不知道你到底想要从回应中得到什么,但这就是你会得到的:点击这里

代码:

import requests
from bs4 import BeautifulSoup
df3 = ['TotalEnergies','Engie','BNP']
currencies = []
for i in df3:
url = f'https://www.boursorama.com/recherche/ajax?query={i}&searchId='
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

symbol = soup.find('a', {'class':'search__list-link'})['href'].split('/')[-2]

url = 'https://www.boursorama.com/bourse/action/graph/ws/GetTicksEOD'
payload = {
'symbol': symbol,
'length': '1',
'period': '0',
'guid': ''}

jsonData = requests.get(url, params=payload).json()
data = jsonData['d']

name = data['Name']
qd = data ['qd']['c']

currencies.append(qd)
print(f'{name}: {qd}')
print(currencies)

输出:

TOTALENERGIES: 45.59
ENGIE: 11.07
BNP PARIBAS: 49.03
[45.59, 11.07, 49.03]

相关内容

最新更新