如何使用python对过滤后的结果进行web抓取(含硒)



我正试图从这个网站上抓取过滤后的结果https://compranet.hacienda.gob.mx/esop/guest/go/public/opportunity/current?locale=es_MX.

首先,我使用了过滤器";Código,descriptción o referencecia del Expediente";,在这之后出现一个新的容器;Continee"&最后,我搜索了一个特定的单词(在这种情况下是"anestesia"(,但我不知道如何刮取结果表以获得出现在"部分中的链接;Descriptción del Expediente";从所有过滤的结果中。我是硒的新手,我想获得过滤后的链接,或者知道是否有其他选择来获得我需要的信息。

这是我的代码:

import random
from time import sleep
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
import time
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import requests
from lxml import html

s=Service('./chromedriver.exe')
driver = webdriver.Chrome(service=s)
driver.get('https://compranet.hacienda.gob.mx/esop/guest/go/public/opportunity/current? 
locale=es_MX')
sleep(5)
driver.find_element(By.XPATH ,"//*[@id='widget_filterPickerSelect']/div[1]/input").click()
sleep(5)
driver.find_element(By.XPATH,"//*[@id='filterPickerSelect_popup1']").click()
sleep(5)
driver.find_element(By.XPATH,"//*[@id='projectInfo_FILTER_OPERATOR_ID']/option[2]").click()
sleep(5)
busqueda = driver.find_element(By.XPATH,"//*[@id='projectInfo_FILTER']")
busqueda.send_keys("anestesia")
busqueda.send_keys(Keys.ENTER)

特别是我想刮

<a href="#fh" class="detailLink" onclick="javascript:goToDetail('2110224', '01000');stopEventPropagation(event);" title="Ver detalle: PC-050GYR017-E140-2022    SERVICIO INTEGRAL DE ANESTESIA, PARA EL EJERCICIO  DEL 1º">PC-050GYR017-E140-2022   SERVICIO INTEGRAL DE ANESTESIA, PARA EL EJERCICIO  DEL 1º</a>

我需要得到链接。

您需要使用显式等待。

为了在最终页面上获得链接,您应该使用find_elementsvisibility_of_all_elements_located,因为存在多个web元素。如果你只想抓取链接,我会说只使用这行print(link.get_attribute('href')),其余两行你可以评论。

代码:

s=Service('./chromedriver.exe')
driver = webdriver.Chrome(service=s)
driver.maximize_window()
wait = WebDriverWait(driver, 20)
driver.get('https://compranet.hacienda.gob.mx/esop/guest/go/public/opportunity/current?locale=es_MX')
wait.until(EC.element_to_be_clickable((By.XPATH, "//input[@value='▼ ']"))).click()
wait.until(EC.element_to_be_clickable((By.XPATH, "//div[@id='filterPickerSelect_popup1']"))).click()
select = Select(wait.until(EC.presence_of_element_located((By.ID, "projectInfo_FILTER_OPERATOR_ID"))))
select.select_by_value('CONTAINS')
busqueda = wait.until(EC.visibility_of_element_located((By.ID, "projectInfo_FILTER")))
busqueda.send_keys("anestesia")
time.sleep(2)
busqueda.send_keys(Keys.ENTER)
links = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//a[@class='detailLink'][@href]")))
for link in links:
print(link.get_attribute('innerText'))
print(link.get_attribute('href'))
print(link.get_attribute('title'))

进口:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

输出:

PC-050GYR017-E140-2022 SERVICIO INTEGRAL DE ANESTESIA, PARA EL EJERCICIO DEL 1º
https://compranet.hacienda.gob.mx/esop/toolkit/opportunity/current/list.si?reset=true&resetstored=true&userAct=changeLangIndex&language=es_MX&_ncp=1649225706261.4394-1#fh
Ver detalle: PC-050GYR017-E140-2022 SERVICIO INTEGRAL DE ANESTESIA, PARA EL EJERCICIO  DEL 1º
SERVICIO DE MANTENIMIENTO PREVENTIVO Y CORRECTIVO DE EQUIPO MÉDICO
https://compranet.hacienda.gob.mx/esop/toolkit/opportunity/current/list.si?reset=true&resetstored=true&userAct=changeLangIndex&language=es_MX&_ncp=1649225706261.4394-1#fh
Ver detalle: SERVICIO DE MANTENIMIENTO PREVENTIVO Y CORRECTIVO DE EQUIPO MÉDICO

最新更新