我正在使用从中克隆的新闻租赁库https://github.com/fhamborg/news-please.我想使用newslease从commoncrawl新闻数据集中获取新闻文章。我正在按照这里的指示运行commoncrawl.py文件。我使用了下面的命令-
python -m newsplease.examples.commoncrawl
在执行以下命令时,我得到了以下错误-
my_local_download_dir_warc=./cc_download_warc/
my_local_download_dir_article=./cc_download_articles/
delete_warc_after_extraction=False
my_number_of_extraction_processes=1
INFO:newsplease.crawler.commoncrawl_crawler:executing: aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/ --no-sign-request > .tmpaws.txt && awk '{ print $4 }' .tmpaws.txt && rm .tmpaws.txt
INFO:newsplease.crawler.commoncrawl_crawler:found 2 files at commoncrawl.org
INFO:newsplease.crawler.commoncrawl_crawler:creating extraction process pool with 1 processes
INFO:newsplease.crawler.commoncrawl_extractor:found local file ./cc_download_warc/https%3A%2F%2Fcommoncrawl.s3.amazonaws.com%2F, not downloading again due to configuration
Traceback (most recent call last):
File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 236, in _detect_type_load_headers
rec_headers = self.arc_parser.parse(stream, statusline)
File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 312, in parse
raise StatusAndHeadersParserException(msg, parts)
warcio.statusandheaders.StatusAndHeadersParserException: Wrong # of headers, expected arc headers ['uri', 'ip-address', 'archive-date', 'content-type', 'length'], Found ['<?xml', 'version="1.0"', 'encoding="UTF-8"?>']
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/examples/commoncrawl.py", line 172, in <module>
main()
File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/examples/commoncrawl.py", line 168, in main
continue_process=True)
File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 320, in crawl_from_commoncrawl
log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 230, in __start_commoncrawl_extractor
log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_extractor.py", line 338, in extract_from_commoncrawl
self.__run()
File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_extractor.py", line 292, in __run
self.__process_warc_gz_file(local_path_name)
File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_extractor.py", line 231, in __process_warc_gz_file
for record in ArchiveIterator(stream):
File "/home/prateek/.local/lib/python3.6/site-packages/warcio/archiveiterator.py", line 110, in _iterate_records
self.record = self._next_record(self.next_line)
File "/home/prateek/.local/lib/python3.6/site-packages/warcio/archiveiterator.py", line 262, in _next_record
self.check_digests)
File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 88, in parse_record_stream
known_format))
File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 243, in _detect_type_load_headers
raise ArchiveLoadFailed(msg + str(se.statusline))
warcio.exceptions.ArchiveLoadFailed: Unknown archive format, first line: ['<?xml', 'version="1.0"', 'encoding="UTF-8"?>']
这里的错误是什么?我该如何解决这个问题。
https://github.com/fhamborg/news-please说采用中的配置部分newslease/examples/commoncrawl.py。这是什么意思
我已经从该文件中复制了配置并粘贴到config.cfg,它位于newslease/config目录中。这是他们指示的吗?或者我在这里犯了一个错误。
我使用的是python 3.6。我的机器上只安装了一条python。
此错误是因为新闻租约使用的库。当我们手动安装每个库时,就会出现错误,而安装的重点是软件包的版本。setup.py文件中提供了每个库的版本信息。安装setup.py文件中给出的确切版本。现在在执行setup.py.时可能会出现问题
所以使用这个命令-
python3 setup.py install
如果您需要卸载所有以前版本的已安装软件包,请运行-
pip3 freeze --user | xargs pip3 uninstall -y
要了解更多方法,请单击此处