AWS ParallelCluster计算节点无法正常启动



我是一个新的parallelCluster 2.11用户,我的计算节点无法正常启动,导致pcluster创建最终失败。这是我的配置文件:

[aws]
aws_region_name = us-east-1
[aliases]
ssh = ssh {CFN_USER}@{MASTER_IP} {ARGS}
[global]
cluster_template = default
update_check = true
sanity_check = true
[cluster default]
key_name = <keypair>
scheduler = slurm
master_instance_type = c5n.2xlarge
base_os = centos7
vpc_settings = default
queue_settings = compute
master_root_volume_size = 1000
compute_root_volume_size = 35
[vpc default]
vpc_id = <my-default-vpc>
master_subnet_id = <my-subnetc>
compute_subnet_id = <my-subnetb>
use_public_ips = false
[queue compute]
enable_efa = true
compute_resource_settings = default
compute_type = ondemand
placement_group = DYNAMIC
disable_hyperthreading = true
[compute_resource default]
instance_type = c5n.18xlarge
initial_count = 1
min_count = 1
max_count = 32
[ebs shared]
shared_dir = shared
volume_type = st1
volume_size = 500

当我运行pcluster create时,我在~15分钟后得到以下错误:以下资源(s)未能创建:

The following resource(s) failed to create: [MasterServer]. 
- AWS::EC2::Instance MasterServer Failed to receive 1 resource signal(s) within the specified duration

如果我在上述故障之前登录到主节点,我将在/var/log/parallelcluster/clustermgtd日志文件中看到以下内容:

2021-09-28 15:42:41,168 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Found the following unhealthy static nodes: (x1) ['compute-st-c5n18xlarge-1(compute-st-c5n18xlarge-1)']
2021-09-28 15:42:41,168 - [slurm_plugin.clustermgtd:_handle_unhealthy_static_nodes] - INFO - Setting unhealthy static nodes to DOWN

然而,尽管将节点设置为DOWN, ec2计算实例继续保持在运行状态,并且上述日志不断发出以下消息:

2021-09-28 15:54:41,156 - [slurm_plugin.clustermgtd:_maintain_nodes] - INFO - Following nodes are currently in replacement: (x1) ['compute-st-c5n18xlarge-1']

此状态一直持续到pcluster create命令失败并出现上述错误。我怀疑我的配置有问题——任何帮助或进一步的故障排除建议将不胜感激。

是否可以在配置文件中不使用min_count参数设置集群?即,指示并行集群创建集群,而不启动计算节点。

我最初使用两个公共子网:一个用于头节点,一个用于计算节点。将计算节点切换到私有子网解决了这个问题。或者,不指定计算子网并将assign_public_ips设置为true也可以解决问题。

经过这些步骤后,计算节点成功启动,我可以通过slurm运行我的作业了。

最新更新