为什么我的 AWS ECS 服务无法启动我的任务?



我在使用Terraform的AWS中创建新的AWS负载均衡器和AWS-ECS存储库、集群和任务时遇到问题。一切都在创建中,没有任何错误。在一个单独的文件中有一些IAM角色和证书。这些是这里的相关定义。发生的情况是ECS服务正在创建一个任务,但该任务在启动后立即关闭。我在Cloudwatch日志组中根本看不到任何日志。事实上,它甚至从未被创造过。

当我第一次运行基础设施时,这一切都会失败,这对我来说是有道理的,因为ECS存储库是全新的,没有推送任何Docker镜像。但我推送了镜像,服务再也没有启动过。我想它会无限循环,试图在失败后启动一项任务,但事实并非如此。

我已经通过破坏服务然后重新创建来迫使它重新启动。考虑到现在有一个映像要运行,我希望这能起作用。它具有与初始启动相同的行为,即服务创建一个无法启动的任务,但没有原因日志,然后再也不会运行任务。

有人知道这是怎么回事吗?或者我可能在哪里看到错误?

locals {
container_name = "tdweb-web-server-container"
}
resource "aws_lb" "web_server" {
name = "tdweb-alb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.lb_sg.id]
subnets = [
aws_subnet.subnet_a.id,
aws_subnet.subnet_b.id,
aws_subnet.subnet_c.id
]
}
resource "aws_security_group" "lb_sg" {
name = "ALB Security Group"
description = "Allows TLS inbound traffic"
vpc_id = aws_vpc.main.id
ingress {
description = "TLS from VPC"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_security_group" "web_server_service" {
name = "Web Sever Service Security Group"
description = "Allows HTTP inbound traffic"
vpc_id = aws_vpc.main.id
ingress {
description = "HTTP from VPC"
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
}
resource "aws_alb_listener" "https" {
load_balancer_arn = aws_lb.web_server.arn
port = 443
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-2016-08"
certificate_arn = aws_acm_certificate.main.arn

default_action {
target_group_arn = aws_lb_target_group.web_server.arn
type = "forward"
}
}
resource "random_string" "target_group_suffix" {
length  = 4
upper   = false
special = false
}
resource "aws_lb_target_group" "web_server" {
name = "web-server-target-group-${random_string.target_group_suffix.result}"
port = 80
protocol = "HTTP"  
target_type = "ip"
vpc_id = aws_vpc.main.id
lifecycle {
create_before_destroy = true
}
}
resource "aws_iam_role" "web_server_task" {
name = "tdweb-web-server-task-role"
assume_role_policy = data.aws_iam_policy_document.web_server_task.json
}
data "aws_iam_policy_document" "web_server_task" {
statement {
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["ecs-tasks.amazonaws.com"]
}
}
}
resource "aws_iam_role_policy_attachment" "web_server_task" {
for_each = toset([
"arn:aws:iam::aws:policy/AmazonSQSFullAccess",
"arn:aws:iam::aws:policy/AmazonS3FullAccess",
"arn:aws:iam::aws:policy/AmazonDynamoDBFullAccess",
"arn:aws:iam::aws:policy/AWSLambdaInvocation-DynamoDB"
])
role = aws_iam_role.web_server_task.name
policy_arn = each.value
}
resource "aws_ecr_repository" "web_server" {
name = "tdweb-web-server-repository"
}
resource "aws_ecs_cluster" "web_server" {
name = "tdweb-web-server-cluster"
}
resource "aws_ecs_task_definition" "web_server" {
family = "task_definition_name"
task_role_arn = aws_iam_role.web_server_task.arn
execution_role_arn = aws_iam_role.ecs_task_execution.arn
network_mode = "awsvpc"
cpu = "1024"
memory = "2048"
requires_compatibilities = ["FARGATE"]
container_definitions = <<DEFINITION
[
{
"name": "${local.container_name}",
"image": "${aws_ecr_repository.web_server.repository_url}:latest",
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/tdweb-task",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
},
"portMappings": [
{
"hostPort": 80,
"protocol": "tcp",
"containerPort": 80
}
],
"cpu": 0,
"essential": true
}
]
DEFINITION
}
resource "aws_ecs_service" "web_server" {
name = "tdweb-web-server-service"
cluster = aws_ecs_cluster.web_server.id
launch_type = "FARGATE"
task_definition = aws_ecs_task_definition.web_server.arn
desired_count = 1
load_balancer {
target_group_arn = aws_lb_target_group.web_server.arn
container_name = local.container_name
container_port = 80
}
network_configuration {
subnets = [
aws_subnet.subnet_a.id,
aws_subnet.subnet_b.id,
aws_subnet.subnet_c.id
]
assign_public_ip = true
security_groups = [aws_security_group.web_server_service.id]
}
}

编辑:要回答评论,这里是VPC和子网

resource "aws_vpc" "main" {
cidr_block = "172.31.0.0/16"
}
resource "aws_subnet" "subnet_a" {
vpc_id     = aws_vpc.main.id
availability_zone = "us-east-1a"
cidr_block = "172.31.0.0/20"
}
resource "aws_subnet" "subnet_b" {
vpc_id     = aws_vpc.main.id
availability_zone = "us-east-1b"
cidr_block = "172.31.16.0/20"
}
resource "aws_subnet" "subnet_c" {
vpc_id     = aws_vpc.main.id
availability_zone = "us-east-1c"
cidr_block = "172.31.32.0/20"
}
resource "aws_internet_gateway" "main" {
vpc_id = aws_vpc.main.id
}

编辑:这是一个有点启发性的更新。我发现这个错误不是在任务日志中,而是在任务中的容器日志中。我从来不知道它在那里。

状态原因CannotPullContainerError:后台进程的错误响应:收到https://563407091361.dkr.ecr.us-east-1.amazonaws.com/v2/:net/http:等待连接时请求被取消(等待标头时超过客户端超时(

服务似乎无法从ECR回购中提取容器。在读了一些书之后,我还不知道如何解决这个问题。我还在四处看看。

根据评论,一个可能的问题是子集中缺乏互联网接入。这可以纠正如下:

# Route table to connect to Internet Gateway
resource "aws_route_table" "public" {
vpc_id = aws_vpc.main.id
route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.main.id
}
}
resource "aws_route_table_association" "subnet_public_a" {
subnet_id      = aws_subnet.subnet_a.id
route_table_id = aws_route_table.public.id
}
resource "aws_route_table_association" "subnet_public_b" {
subnet_id      = aws_subnet.subnet_b.id
route_table_id = aws_route_table.public.id
}

resource "aws_route_table_association" "subnet_public_c" {
subnet_id      = aws_subnet.subnet_c.id
route_table_id = aws_route_table.public.id
}

此外,您还可以将depends_on添加到aws_ecs_service中,以便等待这些附件完成。

协会的一个较短的替代方案:

locals {
subnets = [aws_subnet.subnet_a.id, 
aws_subnet.subnet_b.id,
aws_subnet.subnet_c.id]
}
resource "aws_route_table_association" "subnet_public_b" {
count          = length(local.subnets)

subnet_id      = local.subnets[count.index]
route_table_id = aws_route_table.public.id
}

相关内容

  • 没有找到相关文章

最新更新