如何通过附加值来生成多个列表对象



我有下面的代码,它从所有AWS支持的地区的名为resourcegroupstaggingapi的AWS服务中获取资源信息。现在我正在多个区域中迭代并成功地获取记录,但我的问题是处理需要大量时间,CPU memory也使用了大量时间,执行时间非常高,我大约有40 million records要处理。有人能告诉我优化这个代码的最佳方法是什么吗?我看到生成器提高了性能执行速度,但我不知道如何使用appendyield的多个值。我也是Python的新手,有人能指导我如何改进以下代码吗:

import boto3, os, json
from credentials import AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY
AWS_SUPPORTED_REGIONS = ["ap-northeast-1", "ap-northeast-2", "ap-south-1", "ap-southeast-1", "ap-southeast-2",
"ca-central-1", "eu-central-1", "eu-north-1", "eu-west-1", "eu-west-2", "eu-west-3",
"sa-east-1", "us-east-1", "us-east-2", "us-west-1", "us-west-2"]

def services_info():
services_info = []
services_info_no_owner = []
for region in AWS_SUPPORTED_REGIONS:
client = boto3.client('resourcegroupstaggingapi', region_name=region,
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY
)
paginator = client.get_paginator('get_resources')
resources = []
for page in paginator.paginate():
resources.extend(page["ResourceTagMappingList"])
for resource in resources:
resource_arn = resource.get("ResourceARN")
arn_split = resource_arn.split(':')
service_name = arn_split[2]
resource_owner_info = arn_split[3]
services_info.append({
"resource_arn": resource_arn,
"service_name": service_name,
"region": region,
"owner_info": resource_owner_info
})
if services_info_no_owner.isspace():
services_info_no_owner.append({
"resource_arn": resource_arn,
"service_name": service_name,
"region": region,
"owner_info": resource_owner_info
})
return services_info, services_info_no_owner

services_info, services_info_no_owner = services_info()
try:
with open("services_info.json", 'w') as output:
json.dump(services_info, output, sort_keys=True, indent=4)
except Exception as e:
print("Exception occurred while writing to file")
try:
with open("services_info_no_owner.json", 'w') as output:
json.dump(services_info_no_owner, output, sort_keys=True, indent=4)
except Exception as e:
print("Exception occurred while writing to file")
  1. 我删除了那些通过对新变量应用新内容来定义新变量的行,而是将它们全部放在一行上,这将释放内存
  2. 我试着把你的代码转换成生成器,因为它们更优化了
import boto3, os, json
from credentials import AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY
AWS_SUPPORTED_REGIONS = ["ap-northeast-1", "ap-northeast-2", "ap-south-1", "ap-southeast-1", "ap-southeast-2",
"ca-central-1", "eu-central-1", "eu-north-1", "eu-west-1", "eu-west-2", "eu-west-3",
"sa-east-1", "us-east-1", "us-east-2", "us-west-1", "us-west-2"]
def services_info():
services_info = []
services_info_no_owner = []
def go(region):
resources = [page["ResourceTagMappingList"] for page in boto3.client('resourcegroupstaggingapi', region_name=region,
aws_access_key_id=AWS_ACCESS_KEY_ID,
aws_secret_access_key=AWS_SECRET_ACCESS_KEY
).get_paginator('get_resources').paginate()]
[(services_info.append({
"resource_arn": resource.get("ResourceARN").split(':'),
"service_name": resource.get("ResourceARN").split(':')[2],
"region": region,
"owner_info":  resource.get("ResourceARN").split(':')[3]
}), services_info_no_owner.append({
"resource_arn": resource.get("ResourceARN"),
"service_name": resource.get("ResourceARN").split(':')[2],
"region": region,
"owner_info": resource.get("ResourceARN").split(':')[3]
}))
if services_info_no_owner.isspace()
else services_info.append({
"resource_arn": resource.get("ResourceARN").split(':'),
"service_name": resource.get("ResourceARN").split(':')[2],
"region": region,
"owner_info":  resource.get("ResourceARN").split(':')[3]
})  for resource in resources]
list(map(lambda region: go(region),AWS_SUPPORTED_REGIONS))
return services_info, services_info_no_owner

services_info, services_info_no_owner = services_info()
try:
with open("services_info.json", 'w') as output:
json.dump(services_info, output, sort_keys=True, indent=4)
except Exception as e:
print("Exception occurred while writing to file")
try:
with open("services_info_no_owner.json", 'w') as output:
json.dump(services_info_no_owner, output, sort_keys=True, indent=4)
except Exception as e:
print("Exception occurred while writing to file")

首先,代码似乎不是正确的代码,因为isspace((函数将在列表services_info_no_owner上失败AttributeError:"list"对象没有属性"isspace">

代码创建速度慢的主要原因之一字典项,与列表/元组相比非常慢

你在文件中写了4000万次(列(标题。"resource_arn";"服务名称"区域";"所有者_信息";想象一下时间和空间被用来写作4000万*大约40字节=16亿字节因此json不是正确的格式。一个建议是使用熊猫数据帧,然后使用tocsv((写入csv文件或只需使用列表并手动写入csv即可。主要的好处是你不必在列出时附加字典

使用现有的代码,您可以使用列表理解重新分解第一个循环

for page in paginator.paginate():
resources.extend(page["ResourceTagMappingList"])

带有

resources.extend([page["ResourceTagMappingList"] for page in paginator.paginate()])

将第二个替换为循环,如下所示。使用注释弥补失去的可读性。service_name和resource_owner_info已经在您的resource_arn中,则没有需要单独存放。此外,该地区也将拥有资源,因此需要把它也存起来。

for resource in resources:
resource_arn = resource.get("ResourceARN")
arn_split = resource_arn.split(':')
service_name = arn_split[2]
resource_owner_info = arn_split[3]
services_info.append({"resource_arn": resource_arn,"service_name": service_name,"region": region,"owner_info": resource_owner_info})

带有

services_info = [resource.get("ResourceARN") for resource in resources]

我知道以上两个建议都需要与json文件的用户,但当您有4000万条记录时所取得的进步值得付出努力。

最新更新