如何优化AWS Batch (SPOT)中的Nextflow ?



我正在使用AWS Batch执行Nextflow管道。

  • Managed计算环境
  • SPOT实例
  • SPOT_CAPACITY_OPTIMIZED分配策略。
  • 允许在.xlarge.8xlarge范围内从通用或内存优化的族(例如r4.,r5.,r6i.,m4.,m5.等)的实例
  • 相对较大的vcpu(128或256)

作为一个例子,在运行管道之后,它提交了3个作业,每个作业都定义了CPUs 2memory 8 GB,但是AWS批处理正在为这3个作业部署更大的实例(例如r6i.8xlarge,因此内存和CPU都不是瓶颈),它的利用率可能一直为20%。

如何设置此设置,以便部署的实例不会永远未充分利用?
我试图允许更小的实例类型,但随后作业被卡住为RUNNABLE,并且几个小时都没有移动。

我试着用各种--max_cpusmax_memory执行管道,但我也没有看到任何效果。我哪里做错了?

编辑:

根据建议,我设置了三个ce,并按照以下顺序将它们附加到作业队列中(见下面的配置):

  • 最大.2xlarge个实例,最大64个vcpu
  • 最大.8xlarge个实例,最大128个vcpu
  • optimal,最大256个vcpu

我现在有3个作业卡在RUNNABLE,每个作业都分配了vCPUs 6Memory 36864

{
"computeEnvironments": [
{
"computeEnvironmentName": "ce-spot-optimal-spot-capacity-3",
"computeEnvironmentArn": "arn:aws:batch:ap-southeast-1:088159696610:compute-environment/ce-spot-optimal-spot-capacity-3",
"ecsClusterArn": "arn:aws:ecs:ap-southeast-1:088159696610:cluster/AWSBatch-ce-spot-optimal-spot-capacity-3-dbc12b72-6260-315e-a73c-4169455d2a70",
"tags": {},
"type": "MANAGED",
"state": "ENABLED",
"status": "VALID",
"statusReason": "ComputeEnvironment Healthy",
"computeResources": {
"type": "SPOT",
"allocationStrategy": "SPOT_CAPACITY_OPTIMIZED",
"minvCpus": 0,
"maxvCpus": 64,
"desiredvCpus": 24,
"instanceTypes": [
"m4.2xlarge",
"m4.large",
"m4.xlarge",
"m5.2xlarge",
"m5.large",
"m5.xlarge",
"r5.2xlarge",
"r5.large",
"r5.xlarge",
"r6i.2xlarge",
"r6i.large",
"r6i.xlarge"
],
"subnets": [
"subnet-7d67d035",
"subnet-2912954f",
"subnet-c9a4d690"
],
"securityGroupIds": [
"sg-a5c3b2e4"
],
"instanceRole": "arn:aws:iam::088159696610:instance-profile/BM-BatchCEInstanceRole",
"tags": {},
"bidPercentage": 30,
"launchTemplate": {
"launchTemplateName": "increase-volume",
"version": "1"
},
"ec2Configuration": [
{
"imageType": "ECS_AL2",
"imageIdOverride": "ami-0f8ea3f9358cddf80"
}
]
},
"serviceRole": "arn:aws:iam::088159696610:role/aws-service-role/batch.amazonaws.com/AWSServiceRoleForBatch",
"updatePolicy": {
"terminateJobsOnUpdate": false,
"jobExecutionTimeoutMinutes": 30
},
"containerOrchestrationType": "ECS",
"uuid": "5b44dea7-f980-3cd7-92dc-2dc64d0c821c"
},
{
"computeEnvironmentName": "ce-spot-optimal-spot-capacity-2",
"computeEnvironmentArn": "arn:aws:batch:ap-southeast-1:088159696610:compute-environment/ce-spot-optimal-spot-capacity-2",
"ecsClusterArn": "arn:aws:ecs:ap-southeast-1:088159696610:cluster/AWSBatch-ce-spot-optimal-spot-capacity-2-ea6d28fd-495f-34bb-8ea2-1577fc961cf1",
"tags": {},
"type": "MANAGED",
"state": "ENABLED",
"status": "VALID",
"statusReason": "ComputeEnvironment Healthy",
"computeResources": {
"type": "SPOT",
"allocationStrategy": "SPOT_CAPACITY_OPTIMIZED",
"minvCpus": 0,
"maxvCpus": 128,
"desiredvCpus": 0,
"instanceTypes": [
"m4.2xlarge",
"m4.4xlarge",
"m4.large",
"m5.2xlarge",
"m5.4xlarge",
"m5.8xlarge",
"m5.large",
"m5.xlarge",
"r5.2xlarge",
"r5.4xlarge",
"r5.8xlarge",
"r5.large",
"r6i.2xlarge",
"r6i.4xlarge",
"r6i.8xlarge",
"r6i.large",
"m4.xlarge"
],
"subnets": [
"subnet-7d67d035",
"subnet-2912954f",
"subnet-c9a4d690"
],
"securityGroupIds": [
"sg-a5c3b2e4"
],
"instanceRole": "arn:aws:iam::088159696610:instance-profile/BM-BatchCEInstanceRole",
"tags": {},
"bidPercentage": 30,
"launchTemplate": {
"launchTemplateName": "increase-volume",
"version": "1"
},
"ec2Configuration": [
{
"imageType": "ECS_AL2",
"imageIdOverride": "ami-0f8ea3f9358cddf80"
}
]
},
"serviceRole": "arn:aws:iam::088159696610:role/aws-service-role/batch.amazonaws.com/AWSServiceRoleForBatch",
"updatePolicy": {
"terminateJobsOnUpdate": false,
"jobExecutionTimeoutMinutes": 30
},
"containerOrchestrationType": "ECS",
"uuid": "c331302a-8830-3b58-a914-dc54129e2a35"
},
{
"computeEnvironmentName": "ce-spot-optimal-spot-capacity-1",
"computeEnvironmentArn": "arn:aws:batch:ap-southeast-1:088159696610:compute-environment/ce-spot-optimal-spot-capacity-1",
"ecsClusterArn": "arn:aws:ecs:ap-southeast-1:088159696610:cluster/AWSBatch-ce-spot-optimal-spot-capacity-1-6d15c4c4-8f8f-3081-b6af-38f5dfc47fed",
"tags": {},
"type": "MANAGED",
"state": "ENABLED",
"status": "VALID",
"statusReason": "ComputeEnvironment Healthy",
"computeResources": {
"type": "SPOT",
"allocationStrategy": "SPOT_CAPACITY_OPTIMIZED",
"minvCpus": 0,
"maxvCpus": 256,
"desiredvCpus": 0,
"instanceTypes": [
"optimal"
],
"subnets": [
"subnet-7d67d035",
"subnet-2912954f",
"subnet-c9a4d690"
],
"securityGroupIds": [
"sg-a5c3b2e4"
],
"instanceRole": "arn:aws:iam::088159696610:instance-profile/BM-BatchCEInstanceRole",
"tags": {},
"bidPercentage": 30,
"launchTemplate": {
"launchTemplateName": "increase-volume",
"version": "1"
},
"ec2Configuration": [
{
"imageType": "ECS_AL2",
"imageIdOverride": "ami-0f8ea3f9358cddf80"
}
]
},
"serviceRole": "arn:aws:iam::088159696610:role/aws-service-role/batch.amazonaws.com/AWSServiceRoleForBatch",
"updatePolicy": {
"terminateJobsOnUpdate": false,
"jobExecutionTimeoutMinutes": 30
},
"containerOrchestrationType": "ECS",
"uuid": "9a9c493b-4eec-3820-87a8-b86b93ab9341"
}
]
}

检查作业队列中计算环境的顺序。调度器使用关联计算环境的顺序来确定每个作业的运行位置。因此,为了确保将较小的作业部署到更合适的实例中,请确保按升序列出最合适的计算环境。

否则,我认为为较小的作业使用单独的处理队列可能是需要的。一个单独的队列可以让您最多映射三个更适合这些作业的计算环境。然后,您可以使用queue指令将作业队列分配给这些作业。这当然可以在nextflow.config中使用一个或多个进程选择器来完成。

最新更新