AWS CloudWatch 向 EC2 自动扩展组添加容量的警报一直处于警报状态



我设置了一个 CloudWatch 警报,以便在内存预留> 70% 时向 EC2 自动扩展组添加 1 个容量单位。 警报是在正确的时刻触发的,但此后它已经处于警报状态 16 小时+,EC2 自动扩展组没有任何变化。 可能出现什么问题?

这是我的 ECS 云形成模板:

ECSCluster:
Type: AWS::ECS::Cluster
Properties:
ClusterName: !Ref EnvironmentName
ECSAutoScalingGroup:
DependsOn: ECSCluster
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
VPCZoneIdentifier: !Ref Subnets
LaunchConfigurationName: !Ref ECSLaunchConfiguration
MinSize: !Ref ClusterMinSize
MaxSize: !Ref ClusterMaxSize
DesiredCapacity: !Ref ClusterDesiredCapacity
CreationPolicy:
ResourceSignal:
Timeout: PT15M
UpdatePolicy:
AutoScalingRollingUpdate:
MinInstancesInService: 1
MaxBatchSize: 1
PauseTime: PT15M
SuspendProcesses:
- HealthCheck
- ReplaceUnhealthy
- AZRebalance
- AlarmNotification
- ScheduledActions
WaitOnResourceSignals: true
ScaleUpPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AdjustmentType: ChangeInCapacity
AutoScalingGroupName: !Ref ECSAutoScalingGroup
Cooldown: '1'
ScalingAdjustment: '1'
MemoryReservationAlarmHigh:
Type: AWS::CloudWatch::Alarm
Properties:
EvaluationPeriods: '2'
Statistic: Average
Threshold: '70'
AlarmDescription: Alarm if Cluster Memory Reservation is too high
Period: '60'
AlarmActions:
- Ref: ScaleUpPolicy
Namespace: AWS/ECS
Dimensions:
- Name: ClusterName
Value: !Ref ECSCluster
ComparisonOperator: GreaterThanThreshold
MetricName: MemoryReservation
ScaleDownPolicy:
Type: AWS::AutoScaling::ScalingPolicy
Properties:
AdjustmentType: ChangeInCapacity
AutoScalingGroupName: !Ref ECSAutoScalingGroup
Cooldown: '1'
ScalingAdjustment: '-1'
MemoryReservationAlarmLow:
Type: AWS::CloudWatch::Alarm
Properties:
EvaluationPeriods: '2'
Statistic: Average
Threshold: '30'
AlarmDescription: Alarm if Cluster Memory Reservation is too Low
Period: '60'
AlarmActions:
- Ref: ScaleDownPolicy
Namespace: AWS/ECS
Dimensions:
- Name: ClusterName
Value: !Ref ECSCluster
ComparisonOperator: LessThanThreshold
MetricName: MemoryReservation
ECSLaunchConfiguration:
Type: AWS::AutoScaling::LaunchConfiguration
Properties:
KeyName: !If [IsProd, !Ref 'AWS::NoValue', !Ref KeyName]
ImageId: !Ref ECSAMI
InstanceType: !Ref InstanceType
SecurityGroups:
- !Ref SecurityGroup
IamInstanceProfile: !Ref ECSInstanceProfile
UserData:
"Fn::Base64": !Sub |
#!/bin/bash
source /etc/profile.d/proxy.sh
yum install -y https://s3.amazonaws.com/ec2-downloads-windows/SSMAgent/latest/linux_amd64/amazon-ssm-agent.rpm
yum install -y https://s3.amazonaws.com/amazoncloudwatch-agent/amazon_linux/amd64/latest/amazon-cloudwatch-agent.rpm
yum install -y aws-cfn-bootstrap hibagent
cat >> /opt/aws/amazon-cloudwatch-agent/etc/common-config.toml <<EOF
[proxy]
http_proxy="${!http_proxy}"
https_proxy="${!https_proxy}"
no_proxy="${!no_proxy}"
EOF
/opt/aws/bin/cfn-init -v --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSLaunchConfiguration
/opt/aws/bin/cfn-signal -e $? --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSAutoScalingGroup
/usr/bin/enable-ec2-spot-hibernation
Metadata:
AWS::CloudFormation::Init:
config:
packages:
yum:
collectd: []
commands:
01_add_instance_to_cluster:
command: !Sub echo ECS_CLUSTER=${ECSCluster} >> /etc/ecs/ecs.config
02_enable_cloudwatch_agent:
command: !Sub /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c ssm:${ECSCloudWatchParameter} -s
files:
/etc/cfn/cfn-hup.conf:
mode: 000400
owner: root
group: root
content: !Sub |
[main]
stack=${AWS::StackId}
region=${AWS::Region}
/etc/cfn/hooks.d/cfn-auto-reloader.conf:
content: !Sub |
[cfn-auto-reloader-hook]
triggers=post.update
path=Resources.ECSLaunchConfiguration.Metadata.AWS::CloudFormation::Init
action=/opt/aws/bin/cfn-init -v --region ${AWS::Region} --stack ${AWS::StackName} --resource ECSLaunchConfiguration
services:
sysvinit:
cfn-hup:
enabled: true
ensureRunning: true
files:
- /etc/cfn/cfn-hup.conf
- /etc/cfn/hooks.d/cfn-auto-reloader.conf
# This IAM Role is attached to all of the ECS hosts. It is based on the default role
# published here:
# http://docs.aws.amazon.com/AmazonECS/latest/developerguide/instance_IAM_role.html
#
# You can add other IAM policy statements here to allow access from your ECS hosts
# to other AWS services. Please note that this role will be used by ALL containers
# running on the ECS host.
ECSRole:
Type: AWS::IAM::Role
Properties:
Path: /
RoleName: !Sub ${EnvironmentName}-ECSRole-${AWS::Region}
AssumeRolePolicyDocument: |
{
"Statement": [{
"Action": "sts:AssumeRole",
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
}
}]
}
ManagedPolicyArns:
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/CSOPSRestrictionPolicy"
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPIAMRestrictionPolicy"
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPBasePolicy"
- arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
- arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy
Policies:
- PolicyName: ecs-service
PolicyDocument: |
{
"Statement": [{
"Effect": "Allow",
"Action": [
"ecs:CreateCluster",
"ecs:DeregisterContainerInstance",
"ecs:DiscoverPollEndpoint",
"ecs:Poll",
"ecs:RegisterContainerInstance",
"ecs:StartTelemetrySession",
"ecs:Submit*",
"ecr:BatchCheckLayerAvailability",
"ecr:BatchGetImage",
"ecr:GetDownloadUrlForLayer",
"ecr:GetAuthorizationToken"
],
"Resource": "*"
}]
}
ECSInstanceProfile:
Type: AWS::IAM::InstanceProfile
Properties:
Path: /
Roles:
- !Ref ECSRole
ECSServiceAutoScalingRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
Action:
- "sts:AssumeRole"
Effect: Allow
Principal:
Service:
- application-autoscaling.amazonaws.com
Path: /
ManagedPolicyArns:
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/CSOPSRestrictionPolicy"
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPIAMRestrictionPolicy"
- !Sub "arn:aws:iam::${AWS::AccountId}:policy/HIPBasePolicy"
Policies:
- PolicyName: ecs-service-autoscaling
PolicyDocument:
Statement:
Effect: Allow
Action:
- application-autoscaling:*
- cloudwatch:DescribeAlarms
- cloudwatch:PutMetricAlarm
- ecs:DescribeServices
- ecs:UpdateService
Resource: "*"
ECSCloudWatchParameter:
Type: AWS::SSM::Parameter
Properties:
Description: CloudWatch Log configs for ECS cluster
Name: !Sub AmazonCloudWatch-${ECSCluster}-ECS
Type: String
Value: !Sub |
{
"logs": {
"force_flush_interval": 5,
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/messages",
"log_group_name": "${ECSCluster}/var/log/messages",
"log_stream_name": "{instance_id}",
"timestamp_format": "%b %d %H:%M:%S"
},
{
"file_path": "/var/log/dmesg",
"log_group_name": "${ECSCluster}/var/log/dmesg",
"log_stream_name": "{instance_id}"
},
{
"file_path": "/var/log/docker",
"log_group_name": "${ECSCluster}/var/log/docker",
"log_stream_name": "{instance_id}",
"timestamp_format": "%Y-%m-%dT%H:%M:%S.%f"
},
{
"file_path": "/var/log/ecs/ecs-init.log",
"log_group_name": "${ECSCluster}/var/log/ecs/ecs-init.log",
"log_stream_name": "{instance_id}",
"timestamp_format": "%Y-%m-%dT%H:%M:%SZ"
},
{
"file_path": "/var/log/ecs/ecs-agent.log.*",
"log_group_name": "${ECSCluster}/var/log/ecs/ecs-agent.log",
"log_stream_name": "{instance_id}",
"timestamp_format": "%Y-%m-%dT%H:%M:%SZ"
},
{
"file_path": "/var/log/ecs/audit.log",
"log_group_name": "${ECSCluster}/var/log/ecs/audit.log",
"log_stream_name": "{instance_id}",
"timestamp_format": "%Y-%m-%dT%H:%M:%SZ"
}
]
}
}
},
"metrics": {
"append_dimensions": {
"AutoScalingGroupName": "${!aws:AutoScalingGroupName}",
"InstanceId": "${!aws:InstanceId}",
"InstanceType": "${!aws:InstanceType}"
},
"metrics_collected": {
"collectd": {
"metrics_aggregation_interval": 60
},
"disk": {
"measurement": [
"used_percent"
],
"metrics_collection_interval": 60,
"resources": [
"/"
]
},
"mem": {
"measurement": [
"mem_used_percent"
],
"metrics_collection_interval": 60
},
"statsd": {
"metrics_aggregation_interval": 60,
"metrics_collection_interval": 10,
"service_address": ":8125"
}
}
}
}
ECSClusterParameter:
Type: AWS::SSM::Parameter
Properties:
Description: !Sub ${EnvironmentName} - ECS Cluster
Name: !Sub /${EnvironmentName}/ecs-cluster
Type: String
Value: !Ref ECSCluster
ECSServiceAutoScalingRoleParameter:
Type: AWS::SSM::Parameter
Properties:
Description: !Sub ${EnvironmentName} - ECS Service ASG Role
Name: !Sub /${EnvironmentName}/ecs-service-asg-role
Type: String
Value: !GetAtt ECSServiceAutoScalingRole.Arn

警报活动历史记录:

2019-12-26 11:40:54 Action  Successfully executed action arn:aws:autoscaling:ap-southeast-2:031539715286:scalingPolicy:95e836b6-2f56-498d-b931-7ec4184bedc4:autoScalingGroupName/ECS-UEBZA8GAP8S7-ECSAutoScalingGroup-1BIBTJH5I50W9:policyName/ECS-UEBZA8GAP8S7-ScaleUpPolicy-17LUWE42DC7EO
2019-12-26 11:40:54 State update  Alarm updated from OK to In alarm

确保没有任何进程挂起。 告警通知意味着传入告警不会触发扩展策略。 启动意味着即使期望上升也不会启动任何内容

可能导致此问题的其他常见问题:

  • 如果您使用权重并增加所需的 1,但最低权重不是 1,那么它可能永远无法缩放。

  • 确保没有触发任何其他可能覆盖此扩展策略的扩展策略

  • 检查
  • 活动历史记录以确保没有任何运行状况检查替换不断发生,因为这将启动 5 分钟的冷却时间(默认情况下,因为未在 ASG 上设置冷却时间,仅设置扩展策略(,并且会阻止简单的扩展策略

  • 确保所需的尚未达到最大值

  • 除了触发警报之外,请确保在警报历史记录中看到自动缩放"操作"发生了(该操作实际上每分钟发生一次,警报保持警报状态,无论您的评估设置如何,但只有第一个发布到警报历史记录(

  • 检查 ASG 活动历史记录以了解启动失败,如果使用 Spot 实例,这种情况尤其常见,并且 ASG 最终将在足够多的失败后进入退避状态。 对组的任何手动更新都将重置此退避

您是否指定了"ActionsEnabled=True"?

最新更新