爬行和抓取维基:今日图片

我正在尝试一个宠物项目，需要我爬过维基百科列表：按月浏览每日页面的图片。例如：https://en.wikipedia.org/wiki/Wikipedia:Picture_of_the_day/May_2004 有一个图像列表，后跟每个图像的简短标题。我想在这里做以下两件事：

从页面和相应的标题中抓取所有图像。(最好是存储图像的字典：标题对(
爬过其他月份并重复 1。

有关如何完成此操作的任何帮助将不胜感激。

谢谢。

我建议你在python中使用scrapy，因为它比f.e. selenium轻得多。在函数解析中，您可以查找所有 img 标签，就像这里一样，在给定站点的 html 后。在这里，您可以打印所有找到的图像和文本链接，因为我们需要的所有文本都在<p>标签中，或者在需要时将它们保存到文件中。

import scrapy
from scrapy.crawler import CrawlerProcess
import logging
class Spider(scrapy.Spider):
def __init__(self):
self.name = "WikiScraper"
self.start_urls = ["https://en.wikipedia.org/wiki/Wikipedia:Picture_of_the_day/May_2004"] # Here you can add more links or generate them
def parse(self, response):
for src in response.css('img::attr(src)').extract():
print("Image:", src)
for text in response.css('p *::text'):
print("Text:", text.extract())
if __name__ == "__main__":
logging.getLogger('scrapy').propagate = False
process = CrawlerProcess()
process.crawl(Spider)
process.start()

最后，您必须连接所有应该连接在一起的文本(我没有时间这样做(并添加您需要的所有网站。所有休息我没有提到你可以在刮擦上找到。

希望我没有错过任何东西！

相关内容

最新更新

热门标签：