是否可以使用 Python 循环遍历 Amazon S3 存储桶并计算其文件/密钥中的行数

是否可以使用 Python 循环遍历 Amazon S3 存储桶中的文件/密钥、读取内容并计算行数？

例如：

  1. My bucket: "my-bucket-name"
  2. File/Key : "test.txt"

我需要遍历文件"test.txt"并计算原始文件中的行数。

示例代码：

for bucket in conn.get_all_buckets():
    if bucket.name == "my-bucket-name":
        for file in bucket.list():
            #need to count the number lines in each file and print to a log.

使用boto3可以执行以下操作：

import boto3
# create the s3 resource
s3 = boto3.resource('s3')
# get the file object
obj = s3.Object('bucket_name', 'key')
# read the file contents in memory
file_contents = obj.get()["Body"].read()
# print the occurrences of the new line character to get the number of lines
print file_contents.count('n')

如果要对存储桶中的所有对象执行此操作，可以使用以下代码片段：

bucket = s3.Bucket('bucket_name')
for obj in bucket.objects.all():
    file_contents = obj.get()["Body"].read()
    print file_contents.count('n')

以下是对 boto3 文档的参考，以获取更多功能：http://boto3.readthedocs.io/en/latest/reference/services/s3.html#object

更新：（使用 boto 2）

import boto
s3 = boto.connect_s3()  # establish connection
bucket = s3.get_bucket('bucket_name')  # get bucket
for key in bucket.list(prefix='key'):  # list objects at a given prefix
    file_contents = key.get_contents_as_string()  # get file contents
    print file_contents.count('n')  # print the occurrences of the new line character to get the number of lines

有时将大文件读取到内存远非理想。相反，您可能会发现以下内容更有用：

s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucketname', Key=fileKey)

nlines = 0
for _ in obj['Body'].iter_lines(): nlines+=1
print (nlines)

Amazon S3 只是一项存储服务。您必须获取文件才能对其执行操作（例如读取文件数量）。

您可以使用

boto3 list_objects_v2遍历存储桶。由于list_objects_v2最多只列出 1000 个键（即使您指定了 MaxKeys），因此您必须在响应字典中是否存在NextContinuationToken，然后指定ContinuationToken才能读取下一页。

我在某个答案中编写了示例代码，但我不记得了。

然后使用 get_object（）读取文件，并使用简单的行计数代码

（更新）如果需要特定前缀名称中的键，请添加前缀筛选器。

相关内容

最新更新

热门标签：