使用python和selenium,如何找到网站上文件的隐藏链接



在python3和selenium中,我想从一个页面捕获PDF文件链接。在Inspect Element中,我没有发现这些链接,似乎它们是生成的

因此,我在网站上寻找确切的位置;Documentos"链接框-其中有一个链接列表(Certidão(,当你点击它时,会打开一个带有PDF的新选项卡-示例

然后,我制作了下面的脚本,在PDF链接框中查找XPATH元素,然后调用一个函数来查找链接的确切属性

但它不起作用。有人知道我能做些什么来修复这个或另一个方法吗?

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select

site = "https://divulgacandcontas.tse.jus.br/divulga/#/candidato/2022/2040602022/AP/30001653385"

# Function to get the links with attribute
def find(elem):
element = elem.get_attribute("dvg-link-doc dvg-certidao")
if element:
return element
else:
return False
driver = webdriver.Chrome('D:Codechromedriver.exe') 
driver.get(site)

documentss = []
# Look for the elements in the box where the PDFs are
elems = driver.find_elements("xpath", '/html/body/div[2]/div[1]/div/div[1]/section[3]/div/div[3]/div[2]/div/div/ul')

# Iterate over the elements found
for elem in elems:


# Test if there is a link available
try:
links = WebDriverWait(elem, 2).until(find)
print(links)

if links.endswith(".pdf"):
print(links)
dicionario = {"link": links}
documents.append(dicionario)

except:
continue

这是获取"Documentos"(棕色链接(下pdf文件URL的一种方法:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time as t

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
browser = webdriver.Chrome(service=webdriver_service, options=chrome_options)
url = "https://divulgacandcontas.tse.jus.br/divulga/#/candidato/2022/2040602022/AP/30001653385"
counter = 0
browser.get(url) 

links = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".dvg-link-doc.dvg-certidao")))
for x in range(len(links)):
current_link = links[counter]
print(current_link.text)
t.sleep(1)
current_link.click()
t.sleep(1)
browser.switch_to.window(browser.window_handles[-1])
print(browser.current_url)
t.sleep(1)
browser.get(url) 
counter = counter + 1
links = WebDriverWait(browser, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".dvg-link-doc.dvg-certidao")))
t.sleep(1)

这将在终端中打印出来:

Certidão criminal da Justiça Federal de 2º grau
https://divulgacandcontas.tse.jus.br/candidaturas/oficial/2022/BR/AP/546/candidatos/897646/12_1659631723977.pdf
Certidão criminal da Justiça Federal de 1º grau
https://divulgacandcontas.tse.jus.br/candidaturas/oficial/2022/BR/AP/546/candidatos/897646/11_1659631722277.pdf
Certidão criminal da Justiça Estadual de 2º grau
https://divulgacandcontas.tse.jus.br/candidaturas/oficial/2022/BR/AP/546/candidatos/897646/14_1659631720538.pdf
Certidão criminal da Justiça Estadual de 1º grau
https://divulgacandcontas.tse.jus.br/candidaturas/oficial/2022/BR/AP/546/candidatos/897646/13_1659631719616.pdf

您需要根据自己的selenium设置调整代码,只需在定义浏览器/驱动程序后观察导入和代码即可。硒文档:https://www.selenium.dev/documentation/

我希望您首先找到页面上的所有链接(相关(。从那里,我会得到hrefelement.get_attribute("href"),如果它以.pdf结束,我会假设它是一个pdf。

最新更新