Python Scrapy 不会提取<script>标签



我是Scrapy的新手,我的代码无法从HTML响应中提取所需的脚本标记。

每个请求的HTML响应都以以下结构开始,如浏览器开发工具中所示(例如,随机响应,省略HTML的其余部分,因为相关数据总是在同一个脚本标签上(:

<html  xmlns="http://www.w3.org/1999/xhtml"><head><script src="/pje/a4j/g/3_3_3.Final/org/ajax4jsf/framework.pack.js" type="text/javascript"></script><script src="/pje/a4j/g/3_3_3.Final/org/richfaces/ui.pack.js" type="text/javascript"></script><link class="component" href="/pje/a4j/s/3_3_3.Finalorg/richfaces/renderkit/html/css/basic_classes.xcss/DATB/eAELXT5DOhSIAQ!sA18_" rel="stylesheet" type="text/css" /><link class="component" href="/pje/a4j/s/3_3_3.Finalorg/richfaces/renderkit/html/css/extended_classes.xcss/DATB/eAELXT5DOhSIAQ!sA18_" media="rich-extended-skinning" rel="stylesheet" type="text/css" /><link class="component" href="/pje/a4j/s/3_3_3.Final/org/richfaces/skin.xcss/DATB/eAELXT5DOhSIAQ!sA18_" rel="stylesheet" type="text/css" /><script id="org.ajax4jsf.queue_script" type="text/javascript">if (typeof A4J != 'undefined') { if (A4J.AJAX) { with (A4J.AJAX) {if (!EventQueue.getQueue('org.richfaces.queue.global')) { EventQueue.addQueue(new EventQueue('org.richfaces.queue.global',null,null)) };}}};</script><script type="text/javascript">window.RICH_FACES_EXTENDED_SKINNING_ON=true;</script><link class="user" href="/pje/stylesheet/estilos/bootstrap/bootstrap.min.css" rel="stylesheet" type="text/css" /><link class="user" href="/pje/stylesheet/dropzone/dropzone.css" rel="stylesheet" type="text/css" /><link class="user" href="/pje/stylesheet/estilos/richfaces/tema.css" rel="stylesheet" type="text/css" /><link class="user" href="/pje/stylesheet/padrao.css" rel="stylesheet" type="text/css" /><link class="user" href="/pje/stylesheet/autos-digitais.css" rel="stylesheet" type="text/css" /><script src="/pje/js/modernizr.custom.js" type="text/javascript"></script><script src="/pje/js/jquery-2.1.4.min.js" type="text/javascript"></script><script src="/pje/js/bootstrap/bootstrap.min.js" type="text/javascript"></script><script src="/pje/js/jquery.maskedinput.min.js" type="text/javascript"></script><script src="/pje/js/mousetrap/mousetrap.min.js" type="text/javascript"></script><script src="/pje/js/mousetrap/plugins/global-bind/mousetrap-global-bind.js" type="text/javascript"></script><script src="/pje/js/pje/menu.js" type="text/javascript"></script><script src="/pje/js/global.js" type="text/javascript"></script><script src="/pje/js/pje/autos-digitais.js" type="text/javascript"></script><link class="user" href="/pje/stylesheet/estilos/icomoon/style.css" rel="stylesheet" type="text/css" /><script src="/pje/js/jquery.maskMoney.js" type="text/javascript"></script><script src="/pje/js/pje.js" type="text/javascript"></script><script src="/pje/js/pjeOffice.js" type="text/javascript"></script><script src="/pje/js/signerApplet.js" type="text/javascript"></script></head><script>window.open('https://api-pjestorage.tjdft.jus.br/2021063010s/0709994-47.2021.8.07.0020-1625061173643-2414698-processo.pdf?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minio-pje%2F20210630%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20210630T135253Z&X-Amz-Expires=120&X-Amz-SignedHeaders=host&X-Amz-Signature=3348dc1ce55f1306d4555fb04f933af24ce5fa0b9c2540f5493a04bc83143be5');</script>
<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:c="http://java.sun.com/jsp/jstl/core">
<head>


<title>0709994-47.2021.8.07.0020 &middot; Processo Judicial Eletr&ocirc;nico - 1&ordm; Grau</title>

我的目标是提取始终包含在HTML响应的这一部分中的URL,在上面的示例中是:https://api-pjestorage.tjdft.jus.br/2021063010s/0709994-47.2021.8.07.0020-1625061173643-2414698-processo.pdf?X-Amz算法=AWS4-HMAC-SHA256&X-Amz-Credential=minio pje%2F20210630%2Fus-east-1%2Fs3%2Faws4_request&X-Amz日期=20210630T135253Z&X-Amz-Expires=120&X-Amz-SignedHeaders=主机&X-Amz签名=3348dc1ce55f1306d4555fb04f933af24ce5fa0b9c2540f5493a04bc83143be5

我使用response.css("script").extract()作为测试,打算使用re模块从提到的脚本标签中提取URL,但由于某种原因,Scrapy跳过了所需的标签,并产生了以下内容(我省略了列表的其余部分,因为我只对提取提到的URL感兴趣,但Scrapy继续下一个脚本标签,跳过了我需要的标签(:

'<script src="/pje/a4j/g/3_3_3.Final/org/ajax4jsf/framework.pack.js" type="text/javascript"></script>', '<script src="/pje/a4j/g/3_3_3.Final/org/richfaces/u
i.pack.js" type="text/javascript"></script>', '<script id="org.ajax4jsf.queue_script" type="text/javascript">if (typeof A4J != 'undefined') { if (A4J.AJAX) { with (
A4J.AJAX) {if (!EventQueue.getQueue('org.richfaces.queue.global')) { EventQueue.addQueue(new EventQueue('org.richfaces.queue.global',null,null)) };}}};</script>',
'<script type="text/javascript">window.RICH_FACES_EXTENDED_SKINNING_ON=true;</script>', '<script src="/pje/js/modernizr.custom.js" type="text/javascript"></script>',
'<script src="/pje/js/jquery-2.1.4.min.js" type="text/javascript"></script>', '<script src="/pje/js/bootstrap/bootstrap.min.js" type="text/javascript"></script>', '<
script src="/pje/js/jquery.maskedinput.min.js" type="text/javascript"></script>', '<script src="/pje/js/mousetrap/mousetrap.min.js" type="text/javascript"></script>',
'<script src="/pje/js/mousetrap/plugins/global-bind/mousetrap-global-bind.js" type="text/javascript"></script>', '<script src="/pje/js/pje/menu.js" type="text/javasc
ript"></script>', '<script src="/pje/js/global.js" type="text/javascript"></script>', '<script src="/pje/js/pje/autos-digitais.js" type="text/javascript"></script>',
'<script src="/pje/js/jquery.maskMoney.js" type="text/javascript"></script>', '<script src="/pje/js/pje.js" type="text/javascript"></script>', '<script src="/pje/js/p
jeOffice.js" type="text/javascript"></script>', '<script src="/pje/js/signerApplet.js" type="text/javascript"></script>', '<script type="text/javascript">nt//<![CDA
TA[nt(function($){ntt  $(document).ready(function() {ntttvar selector = 'dtInicioInputDate';nnttt//Seleciona elemento por idntttvar $input = $("in
put[id$='" + selector + "']");ntttntttif($input.length < 1){ntttt//Seleciona elemento por classntttt$input = $("input" + selector);nttt}ntt
tntttif ('99/99/9999' == '') {ntttt$input.unmask();nttt} else {ntttt$input.mask('99/99/9999');nttt}ntt });nt})(jQuery_21);nt//]]>
nt</script>'

上面列表中的最后一个元素出现在所需的脚本标记之后(从HTML响应中省略,因为它与任务无关(:

'<script type="text/javascript">nt//<![CDATA[nt(function($){ntt  $(document).ready(function() {ntttvar selector = 'dtInicioInputDate';nnttt//Seleciona elemento por idntttvar $input = $("in
put[id$='" + selector + "']");ntttntttif($input.length < 1){ntttt//Seleciona elemento por classntttt$input = $("input" + selector);nttt}ntt
tntttif ('99/99/9999' == '') {ntttt$input.unmask();nttt} else {ntttt$input.mask('99/99/9999');nttt}ntt });nt})(jQuery_21);nt//]]>
nt</script>'

我也试过使用

pattern = re.compile(r"'(https://api-pjestorage.tjdft.jus.br/.+)'")
lst_find = re.findall(pattern=pattern, string=response.text)

但是它返回一个空列表,即使当我将HTML复制为字符串并尝试它时,模式工作正常,这表明所需的脚本标记不包含在"中;response.text";出于某种原因,我不明白。

如何获得使用regex的完整原始HTML文本,或者如何确保response.css(或xpath(将提取所需的脚本标记?

为什么Scrapy跳过了一个脚本标签,却正确地提取了所有其他标签?

不幸的是,我无法共享我试图抓取的页面,因为需要登录名和密码。

任何建议都将不胜感激。抱歉我英语不好。

您正在抓取的HTML页面似乎有格式错误的HTML。例如,您有两个<html>元素和两个<head>元素。这种格式错误的HTML可能会阻止scrapy找到您的脚本。

解决这个问题的一种更简单的方法是纯粹通过字符串操作和正则表达式。

  1. 只将HTML的第一行保存到变量firstLine(在第一个换行符n之前(。firstLine = response.text.split('n')[0]
  2. 应用正则表达式:
    lst_find = re.findall(pattern=pattern, string=firstLine)
    

最新更新