Python网页抓取- HTML错误返回不完整



当使用我的代码,HTML回来丢失的数据。
在此之前,一切都运行良好,直到对代码进行了更改,以满足预期的条件Selenium,

代码并不完整,因为它在这里不被接受,但我想你可以看到发生了什么。

navegador = webdriver.Firefox(options = options)
wait = WebDriverWait(navegador, 30)
link = '******'
navegador.get(url = link)
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_txtLogin"))).send_keys('******')
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_txtSenha"))).send_keys('******')
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_btnEnviar"))).click()
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_TreeView2t8"))).click()
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "a[title='07 de dezembro']"))).click()
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_ddlVagasTerminalEmpresa"))).click()
wait.until(EC.element_to_be_clickable((By.ID, "ctl00_ctl00_Content_Content_ddlVagasTerminalEmpresa"))).click()
wait.until(EC.visibility_of_element_located((By.XPATH, '//*[@id="ctl00_ctl00_Content_Content_ddlVagasTerminalEmpresa"]/option[2]'))).click()
teste = wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="divScroll"]'))).get_attribute('innerHTML')
soup = BeautifulSoup(teste, "html.parser")

我得到以下结果:

<table align="center" style="border-right: #66cc00 1px solid; border-top: #66cc00 1px solid; border-left: #66cc00 1px solid; border-bottom: #66cc00 1px solid" width="100%">
<tbody><tr>
<td>
<table>
<tbody><tr>
<td class="Titulo">
<span id="ctl00_ctl00_Content_Content_Label1" style="font-size:12px;">Terminal - Empresa - Exportador:</span>
</td>
<td>
<select class="TextBox" id="ctl00_ctl00_Content_Content_ddlVagasTerminalEmpresa" name="ctl00$ctl00$Content$Content$ddlVagasTerminalEmpresa" onchange="javascript:setTimeout('__doPostBack('ctl00$ctl00$Content$Content$ddlVagasTerminalEmpresa','')', 0)" style="width: 475px;">
<option selected="selected" value="0">Selecione um Terminal.</option>
<option value="68623">TEAG - CARGILL - 04 CARGILL AGRICOLA S A  -  GUARUJA - SP</option>
<option value="68594">TEG  - CARGILL - 04 CARGILL AGRICOLA S A  -  GUARUJA - SP</option>
</select>
</td>
</tr>
</tbody></table>
</td>
</tr>
<tr>
<td class="Titulo">
<span id="ctl00_ctl00_Content_Content_lbl_titulo_principal" style="font-size:12px;">Disponibilização de vagas do dia: 07/12/2022</span></td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td valign="top">
</td>
</tr>
<tr>

我应该把它拿回来。

</tr>
<tr>
<td></td>
</tr>
<tr>
<td valign="top">
<div id="ctl00_ctl00_Content_Content_pn_turno_1" style="width:100%;">

<table width="100%" style="border-right: #66cc00 1px solid; border-top: #66cc00 1px solid; border-left: #66cc00 1px solid; border-bottom: #66cc00 1px solid">
<tbody><tr>
<td class="Titulo">
<span id="ctl00_ctl00_Content_Content_lbl_turno_1">Turno 01 - intervalo: 7/12/2022 0:00:00 as 7/12/2022 1:00:00</span></td>
</tr>
<tr>
<td style="height:200px;width: 100%;" valign="top">
<table border="0" class="Grid" cellpadding="4" cellspacing="2" style="font-size:14;width: 100%;z-index: -1;">

</table>                                                                    
<table border="0" class="Grid" cellpadding="3" cellspacing="2" style="font-size:14;width: 100%">

<tbody><tr class="GridRow">                                
<td width="12%" align="center">
<span id="ctl00_ctl00_Content_Content_rpt_turno_1_ctl01_lblEmpresaTerminal_1" title="TEAG - CARGILL - 04 CARGILL AGRICOLA S A  -  GUARUJA - SP" style="font-size:7px;">CARGILL - TEAG</span>
<input type="image" name="ctl00$ctl00$Content$Content$rpt_turno_1$ctl01$imb_vaga_1" id="ctl00_ctl00_Content_Content_rpt_turno_1_ctl01_imb_vaga_1" title="Vaga agendada." src="../App_Themes/SisLog/Images/caminhao.png" onclick="javascript:window.open('Cadastro.aspx?id_agenda=7054462&amp;id_turno=7/12/2022 0:00:00;7/12/2022 1:00:00&amp;data=07/12/2022&amp;id_turno_exportador=198574&amp;id_turno_agenda=61348&amp;id_transportadora=23213&amp;id_turno_transp=68623&amp;id_Cliente=7708&amp;codigo_terminal=7708&amp;codigo_empresa=1&amp;codigo_exportador=24978&amp;codigo_transportador=23213&amp;codigo_turno=1&amp;turno_transp_vg=68623','_blank','height=850,width=1000,top=(screen.width)?(screen.width-1000)/2 : 0,left=(screen.height)?(screen.height-700)/2 : 0,toolbar=no,location=no,directories=no,status=no,menubar=no,scrollbars=yes,resizable=no');" style="height:20px;border-width:0px;">                                                
</td>

由于您没有分享到您正在工作的页面的链接,我们只能猜测是什么导致了您的问题。
所以,我猜你是从未完全渲染的元素提取文本。
要解决这个问题,请尝试将presence_of_element_located更改为visibility_of_element_located,teste = wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="divScroll"]'))).get_attribute('innerHTML')将是

teste = wait.until(EC.visibility_of_element_located((By.XPATH, '//*[@id="divScroll"]'))).get_attribute('innerHTML')

如果这还不够,请尝试在提取文本之前添加一些延迟,如下所示:

wait.until(EC.visibility_of_element_located((By.XPATH, '//*[@id="divScroll"]')))
time.sleep(2)
teste = navegador.find_element(By.XPATH, '//*[@id="divScroll"]').get_attribute('innerHTML')

如果该元素不可见,则visibility_of_element_located不能应用于其上,只需使用presence_of_element_located与延迟

wait.until(EC.presence_of_element_located((By.XPATH, '//*[@id="divScroll"]')))
time.sleep(2)
teste = navegador.find_element(By.XPATH, '//*[@id="divScroll"]').get_attribute('innerHTML')

最新更新