Sagemaker S3文件夹未完全迭代



我使用for循环来迭代S3 bucket文件夹,并从文件夹中过滤出JSON文件。S3 bucket文件夹中大约有30000个文件,其中大约15000个是JSON文件。当我迭代文件夹时,我只能过滤掉大约300个。没有发生错误!

client = boto3.client(
's3',
aws_access_key_id="xxxxxx",
aws_secret_access_key="xxxxxxx",
)
data = client.list_objects_v2(
Bucket='rawdata',
Prefix='mixedfiles',
)
json_files = [content["Key"] for content in data["Contents"] if content["Key"].endswith(".json")]
for json_file in json_files:
print(json_file)

更新:

client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
response_iterator = paginator.paginate(
Bucket = 'rawdata',
Prefix = 'mixedfiles'
)
for page in response_iterator:
for content in page['Contents']:
if content['Key'].endswith('.json'):
result_json = content['Key']
print(result_json)

您需要使用分页器。

请参阅https://adamj.eu/tech/2018/01/09/using-boto3-think-pagination/获得更多

s3_client = boto3.client('s3')
paginator = s3_client.get_paginator('list_objects')
pages = paginator.paginate(Bucket='my-bucket')
for page in pages:
for obj in page['Contents']:
do_something(obj)