Amazon ECS: Container Orchestration Without the Kubernetes Learning Curve

Running Docker containers in production requires more than docker run. You need to schedule containers across multiple hosts, handle failures by restarting crashed containers, distribute traffic with a load balancer, and manage secrets and configuration. This is container orchestration.

ECS provides all of this with less operational complexity than Kubernetes. There is no control plane to manage, no etcd to back up, no CNI plugin to troubleshoot. You define tasks, ECS places and runs them.

ECS Core Concepts

┌──────────────────────────────────────────────────────────────────┐
│                    ECS Architecture                              │
│                                                                  │
│  Cluster                                                         │
│  └── Service (desired count, deployment config, LB attachment)  │
│       └── Task Definition (container image, CPU, memory, ports) │
│            └── Task (running instance of task definition)       │
│                 └── Container(s)                                 │
│                                                                  │
│  Launch Type:                                                    │
│    EC2     → Task runs on EC2 instances you manage              │
│    Fargate → Task runs on AWS-managed compute (serverless)      │
└──────────────────────────────────────────────────────────────────┘

Cluster: A logical grouping for services and tasks. If you use the EC2 launch type, the cluster also contains the EC2 instances (called container instances).

Task Definition: A JSON document that specifies how one or more containers should run — which image, how much CPU and memory, which ports to expose, what environment variables to set, and which IAM role the containers can use.

Task: A running instance of a task definition. A task may contain multiple containers (a main container and sidecars).

Service: Ensures a specified number of tasks run at all times. If a task crashes, the service scheduler starts a replacement. Services integrate with load balancers for traffic distribution.

Task Definitions

A task definition is the blueprint ECS uses to launch tasks. Here is a practical example:

{
  "family": "web-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "executionRoleArn": "arn:aws:iam::123456789012:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::123456789012:role/web-api-task-role",
  "containerDefinitions": [
    {
      "name": "api",
      "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/web-api:v1.2.0",
      "portMappings": [{"containerPort": 8080, "protocol": "tcp"}],
      "environment": [
        {"name": "NODE_ENV", "value": "production"},
        {"name": "PORT", "value": "8080"}
      ],
      "secrets": [
        {
          "name": "DB_PASSWORD",
          "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789012:secret:prod/db-password"
        }
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/web-api",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "api"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost:8080/health || exit 1"],
        "interval": 15,
        "timeout": 5,
        "retries": 3
      }
    }
  ]
}

The executionRoleArn gives ECS permission to pull the container image from ECR and send logs to CloudWatch. The taskRoleArn gives the container itself permission to call AWS APIs (like S3 or DynamoDB).

aws ecs register-task-definition \
  --cli-input-json file://task-definition.json

Launch Types: EC2 vs Fargate

EC2 Launch Type

You provision EC2 instances and join them to the cluster by installing the ECS container agent. ECS schedules tasks onto these instances based on available CPU and memory.

Advantages:

Full control over instance type, AMI, kernel parameters
Can use GPU instances or storage-optimised instances
Spot Instances can significantly reduce cost
Predictable pricing for stable workloads

Disadvantages:

You manage the EC2 fleet (patching, scaling, right-sizing)
Must plan capacity so instances are not over or under-utilised

Fargate Launch Type

No EC2 instances. You define CPU and memory at the task level, and AWS provides compute transparently.

Advantages:

Zero infrastructure management
Task isolation — each task gets its own kernel (Firecracker microVM)
Scale to zero — no idle EC2 instances to pay for

Disadvantages:

More expensive per unit of compute than EC2 instances at consistent load
No GPU support (use EC2 launch type for GPU workloads)
Slightly slower startup than pre-warmed EC2 containers

Creating a Service

# Create the cluster
aws ecs create-cluster --cluster-name production

# Create the service
aws ecs create-service \
  --cluster production \
  --service-name web-api \
  --task-definition web-api:3 \
  --desired-count 3 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={
    subnets=[subnet-0a1b2c,subnet-0d4e5f],
    securityGroups=[sg-api-tasks],
    assignPublicIp=DISABLED
  }" \
  --load-balancers "targetGroupArn=arn:aws:elasticloadbalancing:...,
    containerName=api,containerPort=8080" \
  --health-check-grace-period-seconds 30

The health-check-grace-period-seconds gives containers time to start before ECS begins evaluating their health.

Networking: awsvpc Mode

The awsvpc networking mode is mandatory for Fargate and strongly recommended for EC2. Each task gets its own Elastic Network Interface (ENI) with its own private IP address in your VPC.

VPC 10.0.0.0/16
├── Public Subnet 10.0.1.0/24
│   └── Application Load Balancer
└── Private Subnet 10.0.2.0/24
    ├── ECS Task (10.0.2.45) — ENI with security group sg-api
    ├── ECS Task (10.0.2.67) — ENI with security group sg-api
    └── ECS Task (10.0.2.89) — ENI with security group sg-api

With awsvpc, you can apply a security group directly to each task. The security group on the RDS instance allows inbound from the task security group only.

Deployment Strategies

ECS services support multiple deployment types:

Rolling update (default): ECS replaces old tasks with new ones gradually. Configure minimumHealthyPercent and maximumPercent to control how aggressively it replaces.

aws ecs update-service \
  --cluster production \
  --service web-api \
  --task-definition web-api:4 \
  --deployment-configuration "minimumHealthyPercent=50,maximumPercent=200"

This allows ECS to temporarily run up to 200% of desired capacity while replacing old tasks, ensuring no downtime.

Blue/Green with CodeDeploy: ECS creates a new “green” set of tasks, shifts traffic from the ALB, and waits for validation before terminating the “blue” tasks. Supports automatic rollback if alarms fire.

Secrets Management

Never put secrets in environment variables as plain text. Use Secrets Manager or SSM Parameter Store:

"secrets": [
  {
    "name": "DATABASE_URL",
    "valueFrom": "arn:aws:secretsmanager:us-east-1:123456789:secret:prod/database-url-abc123"
  },
  {
    "name": "API_KEY",
    "valueFrom": "arn:aws:ssm:us-east-1:123456789:parameter/prod/api-key"
  }
]

ECS injects these as environment variables at task start time. The task execution role needs secretsmanager:GetSecretValue or ssm:GetParameters permission.

Real-World Scenario: Multi-Tier E-Commerce Backend

A three-tier backend on ECS:

┌───────────────────────────────────────────────────────────┐
│  ALB (internet-facing)                                    │
│    /api/*    → api-service (3 Fargate tasks)              │
│    /admin/*  → admin-service (2 Fargate tasks)            │
│                                                           │
│  api-service Task Definition:                             │
│    - Container: api (Node.js 20, 512 CPU, 1024 MB RAM)   │
│    - Pulls image from ECR                                 │
│    - Reads DB_PASSWORD from Secrets Manager               │
│    - Logs to CloudWatch Logs /ecs/api                    │
│                                                           │
│  RDS PostgreSQL (private subnet)                          │
│    - sg-db allows port 5432 from sg-api-tasks             │
└───────────────────────────────────────────────────────────┘

Deployment pipeline: Developer pushes to Git → CodePipeline triggers → CodeBuild builds and pushes image to ECR → CodeDeploy performs blue/green deployment to ECS.

ECS vs EKS: When to Use Each

Consideration	ECS	EKS
Kubernetes required	No	Yes
Learning curve	Lower	Higher
AWS integration	Native	Good (with add-ons)
Custom schedulers	No	Yes
Service mesh	App Mesh	Istio, Linkerd
Portability	AWS only	Portable to other clouds
Team expertise	New team	K8s-experienced team

Choose ECS when your team is new to containers, your workload is on AWS only, and you want simpler operations. Choose EKS when you need Kubernetes-specific features, have existing K8s expertise, or need portability.

Common Interview Questions

Q: What is the difference between a task and a service in ECS? A task is a single running instance of a task definition. A service is a long-running controller that ensures a desired number of tasks are running and handles failure recovery, rolling deployments, and load balancer integration.

Q: What is the difference between the execution role and the task role? The execution role is used by ECS itself — to pull container images from ECR and write logs to CloudWatch. The task role is used by the containers — to call AWS APIs from within the application code.

Q: Can ECS tasks communicate with each other without going through a load balancer? Yes, with awsvpc networking, tasks have private IPs and can communicate directly if security groups allow it. AWS Cloud Map can provide service discovery — a task registers its IP and port, and other tasks look it up by name.

Q: What happens if an ECS task fails? The ECS service scheduler detects the failure (via container exit code or ELB health check failure), deregisters the task from the load balancer, and starts a replacement task. The desired count is maintained automatically.