我正在尝试网络抓取Google,但我不断收到重复的图像。它下载了大约 200 张,但只有 60 张左右的独特图像。如何获得更多独特的图像并消除重复项?
这是我的代码:
import json
import os
import time
import requests
from PIL import Image
from StringIO import StringIO
from requests.exceptions import ConnectionError
import string
import urllib
import random
def go(query, path):
BASE_PATH = os.path.join(path, query)
if not os.path.exists(BASE_PATH):
os.makedirs(BASE_PATH)
resultitem = 0
file_save_dir = BASE_PATH
filename_length = 10
filename_charset = string.ascii_letters + string.digits
ipaddress = '163.118.75.137'
url = 'https://ajax.googleapis.com/ajax/services/search/images?'
'v=1.0&q=' + query + '&start=%d'
while(resultitem < 60):
response = requests.get(url % resultitem)
results = json.loads(response.text)
for result in results['responseData']['results']:
print result['unescapedUrl']
filename = ''.join(random.choice(filename_charset)
for s in range(filename_length))
urllib.urlretrieve (result['unescapedUrl'],
os.path.join(file_save_dir, filename + '.png'))
resultitem = resultitem + 1 # or + 8 Duplicates?
def main():
go('angry human face', 'myDirectory')
if __name__ == "__main__":
main()
问题就在这里:
filename = ''.join(random.choice(filename_charset)
for s in range(filename_length))
它不是唯一的,您已经覆盖了文件。
您应该改用tempfile
模块
或者,由于您真正关心的是一个唯一的文件名,您可以这样做:
for idx, result in enumerate(results['responseData']['results']):
print result['unescapedUrl']
filename = "IMG%s" % idx
idx
这里将是每个URL的唯一编号