本应抓取pdf、doc文件的剪贴脚本无法正常工作



我正试图在我的项目中实现一个类似的脚本,下面是我的博客文章:https://www.imagescape.com/blog/scraping-pdf-doc-and-docx-scrapy/

蜘蛛类的代码来源:

import re
import textract
from itertools import chain
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from tempfile import NamedTemporaryFile
control_chars = ''.join(map(chr, chain(range(0, 9), range(11, 32), range(127, 160))))
CONTROL_CHAR_RE = re.compile('[%s]' % re.escape(control_chars))
TEXTRACT_EXTENSIONS = [".pdf", ".doc", ".docx", ""]
class CustomLinkExtractor(LinkExtractor):
def __init__(self, *args, **kwargs):
super(CustomLinkExtractor, self).__init__(*args, **kwargs)
# Keep the default values in "deny_extensions" *except* for those types we want.
self.deny_extensions = [ext for ext in self.deny_extensions if ext not in TEXTRACT_EXTENSIONS]
class ItsyBitsySpider(CrawlSpider):
name = "itsy_bitsy"
start_urls = [
'https://www.imagescape.com/media/uploads/zinnia/2018/08/20/scrape_me.html'
]
def __init__(self, *args, **kwargs):
self.rules = (Rule(CustomLinkExtractor(), follow=True, callback="parse_item"),)
super(ItsyBitsySpider, self).__init__(*args, **kwargs)
def parse_item(self, response):
if hasattr(response, "text"):
# The response is text - we assume html. Normally we'd do something                                                                                                                    
# with this, but this demo is just about binary content, so...                                                                                                                         
pass
else:
# We assume the response is binary data                                                                                                                                                
# One-liner for testing if "response.url" ends with any of TEXTRACT_EXTENSIONS                                                                                                         
extension = list(filter(lambda x: response.url.lower().endswith(x), TEXTRACT_EXTENSIONS))[0]
if extension:
# This is a pdf or something else that Textract can process                                                                                                                        
# Create a temporary file with the correct extension.                                                                                                                              
tempfile = NamedTemporaryFile(suffix=extension)
tempfile.write(response.body)
tempfile.flush()
extracted_data = textract.process(tempfile.name)
extracted_data = extracted_data.decode('utf-8')
extracted_data = CONTROL_CHAR_RE.sub('', extracted_data)
tempfile.close()
with open("scraped_content.txt", "a") as f:
f.write(response.url.upper())
f.write("n")
f.write(extracted_data)
f.write("nn")

我当前的python是:3.10,我的操作系统是windows 10。当它试图作为一个抓取器执行时返回的错误

PS C:UsersUSERDesktopgit repotut> scrapy crawl itsy_bitsy
2021-12-12 22:43:10 [scrapy.utils.log] INFO: Scrapy 2.5.1 started (bot: tut)
2021-12-12 22:43:10 [scrapy.utils.log] INFO: Versions: lxml 4.6.4.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.10.0 (tags/v3.10.0:b494f59, Oct  4 2021, 19:00:18) [MSC v.1929 64 bit (AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l  24 Aug 2021), cryptography 35.0.0, Platform Windows-10-10.0.19042-SP0
2021-12-12 22:43:10 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-12-12 22:43:10 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'tut',
'NEWSPIDER_MODULE': 'tut.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['tut.spiders']}
2021-12-12 22:43:10 [scrapy.extensions.telnet] INFO: Telnet Password: ##
2021-12-12 22:43:10 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2021-12-12 22:43:10 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-12-12 22:43:10 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-12-12 22:43:10 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-12-12 22:43:10 [scrapy.core.engine] INFO: Spider opened
2021-12-12 22:43:10 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-12-12 22:43:10 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-12-12 22:43:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imagescape.com/robots.txt> (referer: None)
2021-12-12 22:43:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imagescape.com/media/uploads/zinnia/2018/08/20/scrape_me.html> (referer: None)
2021-12-12 22:43:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imagescape.com/media/uploads/zinnia/2018/08/20/sampletext.docx> (referer: https://www.imagescape.com/media/uploads/zinnia/2018/08/20/scrape_me.html)
2021-12-12 22:43:13 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.imagescape.com/media/uploads/zinnia/2018/08/20/sampletext.docx> (referer: https://www.imagescape.com/media/uploads/zinnia/2018/08/20/scrape_me.html)
Traceback (most recent call last):
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyutilsdefer.py", line 120, in iter_errback
yield next(it)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyutilspython.py", line 353, in __next__
return next(self.data)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyutilspython.py", line 353, in __next__
return next(self.data)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapycorespidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyspidermiddlewaresoffsite.py", line 29, in process_spider_output
for x in result:
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapycorespidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyspidermiddlewaresreferer.py", line 342, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapycorespidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyspidermiddlewaresurllength.py", line 40, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapycorespidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyspidermiddlewaresdepth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapycorespidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyspiderscrawl.py", line 114, in _parse_response
cb_res = callback(response, **cb_kwargs) or ()
File "C:UsersUSERDesktopgit repotuttutspidersspider1.py", line 42, in parse_item
extracted_data = textract.process(tempfile.name)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagestextractparsers__init__.py", line 79, in process
return parser.process(filename, input_encoding, output_encoding, **kwargs)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagestextractparsersutils.py", line 46, in process
byte_string = self.extract(filename, **kwargs)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagestextractparsersdocx_parser.py", line 11, in extract
return docx2txt.process(filename)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesdocx2txtdocx2txt.py", line 76, in process
zipf = zipfile.ZipFile(docx)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libzipfile.py", line 1240, in __init__
self.fp = io.open(file, filemode)
PermissionError: [Errno 13] Permission denied: 'C:\Users\USER\AppData\Local\Temp\tmpvp9upczz.docx'
2021-12-12 22:43:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imagescape.com/media/uploads/zinnia/2018/08/20/sampletext.pdf> (referer: https://www.imagescape.com/media/uploads/zinnia/2018/08/20/scrape_me.html)
2021-12-12 22:43:13 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.imagescape.com/media/uploads/zinnia/2018/08/20/sampletext.pdf> (referer: https://www.imagescape.com/media/uploads/zinnia/2018/08/20/scrape_me.html)
Traceback (most recent call last):
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagestextractparsersutils.py", line 87, in run
pipe = subprocess.Popen(
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsubprocess.py", line 966, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsubprocess.py", line 1435, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2] The system cannot find the file specified
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyutilsdefer.py", line 120, in iter_errback
yield next(it)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyutilspython.py", line 353, in __next__
return next(self.data)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyutilspython.py", line 353, in __next__
return next(self.data)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapycorespidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyspidermiddlewaresoffsite.py", line 29, in process_spider_output
for x in result:
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapycorespidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyspidermiddlewaresreferer.py", line 342, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapycorespidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyspidermiddlewaresurllength.py", line 40, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapycorespidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyspidermiddlewaresdepth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapycorespidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyspiderscrawl.py", line 114, in _parse_response
cb_res = callback(response, **cb_kwargs) or ()
File "C:UsersUSERDesktopgit repotuttutspidersspider1.py", line 42, in parse_item
extracted_data = textract.process(tempfile.name)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagestextractparsers__init__.py", line 79, in process
return parser.process(filename, input_encoding, output_encoding, **kwargs)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagestextractparsersutils.py", line 46, in process
byte_string = self.extract(filename, **kwargs)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagestextractparserspdf_parser.py", line 29, in extract
raise ex
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagestextractparserspdf_parser.py", line 21, in extract
return self.extract_pdftotext(filename, **kwargs)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagestextractparserspdf_parser.py", line 44, in extract_pdftotext
stdout, _ = self.run(args)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagestextractparsersutils.py", line 95, in run
raise exceptions.ShellError(
textract.exceptions.ShellError: The command `pdftotext C:UsersUSERAppDataLocalTemptmpg2cla7xb.pdf -` failed with exit code 127
------------- stdout -------------
------------- stderr -------------
2021-12-12 22:43:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imagescape.com/media/uploads/zinnia/2018/08/20/sampletext.doc> (referer: https://www.imagescape.com/media/uploads/zinnia/2018/08/20/scrape_me.html)
2021-12-12 22:43:14 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.imagescape.com/media/uploads/zinnia/2018/08/20/sampletext.doc> (referer: https://www.imagescape.com/media/uploads/zinnia/2018/08/20/scrape_me.html)
Traceback (most recent call last):
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyutilsdefer.py", line 120, in iter_errback
yield next(it)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyutilspython.py", line 353, in __next__
return next(self.data)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyutilspython.py", line 353, in __next__
return next(self.data)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapycorespidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyspidermiddlewaresoffsite.py", line 29, in process_spider_output
for x in result:
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapycorespidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyspidermiddlewaresreferer.py", line 342, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapycorespidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyspidermiddlewaresurllength.py", line 40, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapycorespidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyspidermiddlewaresdepth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapycorespidermw.py", line 56, in _evaluate_iterable
for r in iterable:
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagesscrapyspiderscrawl.py", line 114, in _parse_response
cb_res = callback(response, **cb_kwargs) or ()
File "C:UsersUSERDesktopgit repotuttutspidersspider1.py", line 42, in parse_item
extracted_data = textract.process(tempfile.name)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagestextractparsers__init__.py", line 79, in process
return parser.process(filename, input_encoding, output_encoding, **kwargs)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagestextractparsersutils.py", line 46, in process
byte_string = self.extract(filename, **kwargs)
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagestextractparsersdoc_parser.py", line 9, in extract
stdout, stderr = self.run(['antiword', filename])
File "C:UsersUSERAppDataLocalProgramsPythonPython310libsite-packagestextractparsersutils.py", line 106, in run
raise exceptions.ShellError(
textract.exceptions.ShellError: The command `antiword C:UsersUSERAppDataLocalTemptmpndf_bon7.doc` failed with exit code 1
------------- stdout -------------
b''------------- stderr -------------
b'Traceback (most recent call last):rn  File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_mainrn    return _run_code(code, main_globals, None,rn  File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_codern    exec(code, run_globals)rn  File "C:\Users\USER\AppData\Local\Programs\Python\Python310\Scripts\antiword.exe\__main__.py", line 7, in <module>rn  File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\site-packages\antiword.py", line 21, in mainrn    r = run(cmd)rn  File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 501, in runrn    with Popen(*popenargs, **kwargs) as process:rn  File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 966, in __init__rn    self._execute_child(args, executable, preexec_fn, close_fds,rn  File "C:\Users\USER\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 1435, in _execute_childrn    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,rnFileNotFoundError: [WinError 2] The system cannot find the file specifiedrn'      
2021-12-12 22:43:14 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-12 22:43:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1649,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 5,
'downloader/response_bytes': 46050,
'downloader/response_count': 5,
'downloader/response_status_count/200': 5,
'elapsed_time_seconds': 3.548882,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 12, 16, 43, 14, 330047),
'httpcompression/response_bytes': 230,
'httpcompression/response_count': 1,
'log_count/DEBUG': 5,
'log_count/ERROR': 3,
'log_count/INFO': 10,
'request_depth_max': 1,
'response_received_count': 5,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 4,
'scheduler/dequeued/memory': 4,
'scheduler/enqueued': 4,
'scheduler/enqueued/memory': 4,
'spider_exceptions/PermissionError': 1,
'spider_exceptions/ShellError': 2,
'start_time': datetime.datetime(2021, 12, 12, 16, 43, 10, 781165)}
2021-12-12 22:43:14 [scrapy.core.engine] INFO: Spider closed (finished)
PS C:UsersUSERDesktopgit repotut>

我已经在博客上安装了所有提到的pip包,我认为这是由于反单词模块上的一些错误。但它也作为pip软件包成功安装。请帮我排除故障。

该程序本应在linux中运行,因此需要执行一些步骤才能在windows中运行。

1.安装库。

安装在蟒蛇:

conda install -c conda-forge poppler
conda install -c conda-forge pdftotext

管道安装:

pip install python-poppler
pip install pdftotext

2.下载antiword,将文件夹解压缩到C:\(important(,并将其添加到PATH

3.出现问题,因为您试图在文件仍在使用时打开该文件。

更改:

tempfile = NamedTemporaryFile(suffix=extension)
tempfile.write(response.body)
tempfile.flush()
extracted_data = textract.process(tempfile.name)
extracted_data = extracted_data.decode('utf-8')
extracted_data = CONTROL_CHAR_RE.sub('', extracted_data)
tempfile.close()

至:

tempfile = NamedTemporaryFile(suffix=extension, delete=False)
tempfile.write(response.body)
tempfile.close()
extracted_data = textract.process(tempfile.name)
extracted_data = extracted_data.decode('utf-8')
extracted_data = CONTROL_CHAR_RE.sub('', extracted_data)

4.打开一个新终端以重新加载PATH环境变量

5.运行scrapy crawl itsy_bitsy并享受。

最新更新