End-To-End-Documentation

End-To-End Documentation 

This document provides a consolidated, end-to-end view of the architecture design, infrastructure experiments, and Kubernetes-based deployment for the Helium application. It is structured as a practical engineering reference and runnable runbook.


1. Architecture Review & Design

1.1 Architecture Understanding

  • Reviewed the provided AWS architecture diagram in detail

  • Identified major components, service boundaries, and dependencies

  • Understood data flow between frontend, backend, worker, and data layers

  • Analyzed interactions between AWS-managed services and external integrations

1.2 Architecture Diagram Creation

  • Created an updated AWS architecture diagram

  • Ensured clear separation of:

  • Frontend layer

  • Backend API services

  • Asynchronous worker services

  • Data and caching components

  • Represented VPC boundaries, load balancers, and external service integrations

1.3 AWS Services Studied

  • Amazon VPC and subnet design

  • ECS Fargate for container-based workloads

  • Application Load Balancer (ALB)

  • NAT Gateway for outbound connectivity

  • IAM roles and permissions

  • Amazon SQS for asynchronous processing

1.4 Architectural Enhancements Incorporated

AWS Global Accelerator

  • Considered for improving global traffic routing and reducing latency

Asynchronous Processing with SQS

  • SQS positioned between Backend API and Worker services

  • Designed to decouple synchronous API requests from background processing

1.5 Data Flow Design

  • Frontend → Backend API

  • Backend API → SQS

  • Worker services → Cache / external services


2. ECS, Fargate & Core Infrastructure Setup (Failed During Initial Configuration)

2.1 Fargate & ElastiCache Redis Integration

  • Attempted integration of backend and worker Fargate containers with ElastiCache Redis

  • Reviewed VPC and subnet requirements for Redis connectivity

  • Configured security group rules for Redis access

  • Updated ECS task IAM roles for backend and worker containers

  • Resolved S3 permission issues related to s3:HeadObject vs s3:GetObject

  • Reviewed ECS task role, execution role, and IAM policy behavior

2.2 Debugging & IAM Fixes

  • Investigated issues related to:

  • Host port visibility in task definitions

  • Essential container configuration

  • Missing S3 permissions

  • Corrected invalid IAM actions causing Invalid Action: s3:HeadObject errors

2.3 ElastiCache (Redis / Valkey) Setup

  • Created an ElastiCache Redis (Valkey) cluster inside the VPC

  • Restricted port 6379 access to the Fargate task security group

2.4 ECS Services & Networking Fixes

  • Created backend and worker ECS Fargate services

  • Fixed subnet configuration for frontend, backend, and worker services

  • Verified ENI attachment and internal VPC routing

  • Verified ALB → backend service connectivity

  • Investigated ECS service rollback failures using:

  • ECS service events

  • Task definitions

  • IAM permissions

  • Verified IAM access for ECS, ECR, CloudWatch Logs, and VPC operations

2.5 Deployment & Resource Configuration

  • Continued ECS Fargate deployment troubleshooting

  • Reviewed CPU and memory configuration at task and container levels

  • Validated ALB target group behavior and frontend–backend traffic flow


3. Kubernetes (kind) Based Deployment & Execution

3.1 Docker Image Build & Push (Docker Hub)

docker login

docker build -t <dockerhub-username>/helium-frontend:latest ./frontend
docker build -t <dockerhub-username>/helium-backend:latest ./backend


docker push <dockerhub-username>/helium-frontend:latest
docker push <dockerhub-username>/helium-backend:latest


3.2 Kubernetes Tooling Installation


#!/bin/bash

set -e  # Exit immediately if a command exits with a non-zero status

echo "Updating package index..."

sudo apt-get update

echo "Installing prerequisites..."

sudo apt-get install -y apt-transport-https ca-certificates curl software-properties-common

echo "Adding Docker's official GPG key..."

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

echo "Setting up Docker stable repository..."

echo \

  "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \

  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

echo "Installing Docker..."

sudo apt-get update

sudo apt-get install -y docker-ce docker-ce-cli containerd.io

echo "Starting and enabling Docker..."

sudo systemctl start docker

sudo systemctl enable docker

echo "Adding current user to docker group..."

sudo usermod -aG docker $USER

echo "Please log out and log back in so that group changes take effect."

# ------------------------

echo "Installing kind..."

[ $(uname -m) = x86_64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.27.0/kind-linux-amd64

chmod +x ./kind

sudo mv ./kind /usr/local/bin/kind

echo "Installing kubectl..."

VERSION="v1.30.0"

URL="https://dl.k8s.io/release/${VERSION}/bin/linux/amd64/kubectl"

INSTALL_DIR="/usr/local/bin"

curl -LO "$URL"

chmod +x kubectl

sudo mv kubectl $INSTALL_DIR/

kubectl version --client

echo "Cleaning up..."

rm -f kubectl

rm -f kind

echo "kind & kubectl installation complete!"\

chmod +x install.sh
./install.sh

3.3 Kubernetes Cluster Creation

kind create cluster --config kind-cluster.yaml
kubectl get nodes

3.3.1 Namespace & Secrets Setup

kubectl create namespace helium
kubectl apply -n helium -f helium-backend-secret.yaml
kubectl apply -n helium -f helium-worker-secret.yaml

3.3.2 Application Deployment

kubectl apply -n helium -f helium-app.yaml
kubectl get pods -n helium
kubectl get svc -n helium

helium-app.yaml

# (File added exactly as provided by the user)

apiVersion: v1
kind: Namespace
metadata:
  name: helium
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: helium-frontend
  namespace: helium
spec:
  replicas: 2
  selector:
    matchLabels:
      app: helium-frontend
  template:
    metadata:
      labels:
        app: helium-frontend
    spec:
      containers:
        - name: frontend
          image: <dockerhub-username>/helium-frontend:latest
          ports:
            - containerPort: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: helium-backend
  namespace: helium
spec:
  replicas: 2
  selector:
    matchLabels:
      app: helium-backend
  template:
    metadata:
      labels:
        app: helium-backend
    spec:
      containers:
        - name: backend
          image: <dockerhub-username>/helium-backend:latest
          ports:
            - containerPort: 8000
          envFrom:
            - secretRef:
                name: helium-backend-secret
          resources:
            requests:
              cpu: "250m"
              memory: "512Mi"
            limits:
              cpu: "500m"
              memory: "1Gi"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: helium-worker
  namespace: helium
spec:
  replicas: 1
  selector:
    matchLabels:
      app: helium-worker
  template:
    metadata:
      labels:
        app: helium-worker
    spec:
      containers:
        - name: worker
          image: <dockerhub-username>/helium-worker:latest
          envFrom:
            - secretRef:
                name: helium-worker-secret
          resources:
            requests:
              cpu: "250m"
              memory: "512Mi"
            limits:
              cpu: "500m"
              memory: "1Gi"
---
apiVersion: v1
kind: Service
metadata:
  name: helium-frontend
  namespace: helium
spec:
  type: NodePort
  selector:
    app: helium-frontend
  ports:
    - port: 80
      targetPort: 80
      nodePort: 30000
---
apiVersion: v1
kind: Service
metadata:
  name: helium-backend
  namespace: helium
spec:
  type: NodePort
  selector:
    app: helium-backend
  ports:
    - port: 8000
      targetPort: 8000
      nodePort: 30001
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: helium-backend-hpa
  namespace: helium
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: helium-backend
  minReplicas: 2
  maxReplicas: 6
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: helium-worker-hpa
  namespace: helium
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: helium-worker
  minReplicas: 1
  maxReplicas: 5
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

3.4 Nginx Reverse Proxy Setup & Configuration

sudo apt update
sudo apt install -y nginx
sudo cp helium.conf /etc/nginx/sites-available/helium
sudo ln -s /etc/nginx/sites-available/helium /etc/nginx/sites-enabled/helium
sudo nginx -t
sudo systemctl restart nginx

helium.conf

server {
    listen 80;
    server_name _;

    # FRONTEND
    location / {
        proxy_pass http://127.0.0.1:30000;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

    # BACKEND API
    location /api/ {
        proxy_pass http://127.0.0.1:30001;
        proxy_http_version 1.1;

        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Traffic Flow

  • Browser → EC2 Public IP (port 80)

  • / routed to frontend NodePort 30000

  • /api/ routed to backend NodePort 30001


3.5 Horizontal Pod Autoscaling

  • HPA definitions are included directly in helium-app.yaml

  • Autoscaling is applied automatically during application deployment


3.6 Metrics Server Installation

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
kubectl get pods -n kube-system | grep metrics-server

3.7 Scaling Validation

watch kubectl get pods -n helium
  • Traffic was generated against the application UI to validate backend and worker pod autoscaling behavior


3.8 How to Run & Verify (Quick Flow)

# Build & push images
# Install tools
# Create cluster
# Deploy application
# Configure Nginx

4. AWS Infrastructure Foundation – Singapore (Detailed, Step-by-Step)

The following section is included verbatim from the original implementation guide. No steps or technical details have been removed.


Phase 1: Infrastructure Foundation (Singapore)

Overview

This phase sets up the base network infrastructure in Singapore (ap-southeast-1) using the AWS Management Console.

  • Region: ap-southeast-1 (Singapore)

  • Estimated Time: 45–60 minutes

Prerequisites


Step 1: Virtual Private Cloud (VPC) Setup

1.1 Create VPC

  1. Navigate to VPC service.

  2. Click Create VPC.

  3. VPC settings:

    • Resources to create: VPC only

    • Name tag: helium-singapore-vpc

    • IPv4 CIDR block: 10.0.0.0/16

    • Tenancy: Default

  4. Click Create VPC.

1.2 Create Subnets

  1. Click SubnetsCreate subnet.

  2. Select VPC ID: helium-singapore-vpc.

Public Subnets

  • helium-public-subnet-1 | AZ: ap-southeast-1a | CIDR: 10.0.1.0/24

  • helium-public-subnet-2 | AZ: ap-southeast-1b | CIDR: 10.0.2.0/24

Private Subnets

  • helium-private-subnet-1 | AZ: ap-southeast-1a | CIDR: 10.0.10.0/24

  • helium-private-subnet-2 | AZ: ap-southeast-1b | CIDR: 10.0.11.0/24

  1. Click Create subnet.

1.3 Internet Gateway (IGW)

  1. Navigate to Internet gatewaysCreate internet gateway.

  2. Name tag: helium-igw.

  3. Create and attach to VPC helium-singapore-vpc.

1.4 NAT Gateway

  1. Navigate to NAT gatewaysCreate NAT gateway.

  2. Name: helium-nat-gw.

  3. Subnet: helium-public-subnet-1.

  4. Connectivity type: Public.

  5. Allocate Elastic IP.

  6. Click Create NAT gateway.

1.5 Route Tables

Public Route Table

  • Name: helium-public-rt

  • Route: 0.0.0.0/0 → Internet Gateway

  • Associate public subnets

Private Route Table

  • Name: helium-private-rt

  • Route: 0.0.0.0/0 → NAT Gateway

  • Associate private subnets


Step 2: Security Groups

2.1 Load Balancer Security Group

  • Name: helium-alb-sg

  • Inbound rules:

  • HTTP (80) from 0.0.0.0/0

  • HTTPS (443) from 0.0.0.0/0

2.2 ECS Task Security Group

  • Name: helium-ecs-sg

  • Inbound rules:

  • TCP 8000 from helium-alb-sg

  • TCP 3000 from helium-alb-sg


Step 3: ECR Repositories

  1. Navigate to Elastic Container Registry (ECR).

  2. Create private repositories:

    • helium-backend


Step 4: SSL Certificate (ACM)

  1. Navigate to Certificate Manager.

  2. Request a public certificate.

  3. Domain: he2.site

  4. Additional name: *.he2.site

  5. Validation method: DNS.

  6. Create DNS records in Route 53.


Step 5: Verification Checklist

  • VPC helium-singapore-vpc exists

  • Public subnets route to IGW

  • Private subnets route to NAT Gateway

  • Security group rules validated

  • ECR repositories created

  • ACM certificate issued


Regional Reuse (Virginia & Mumbai)

The exact same steps above are repeated for:

  • Virginia (us-east-1)

  • Mumbai (ap-south-1)

Only the following change:

  • AWS Region

  • Availability Zones

  • Resource names (VPC, subnets, ALB, ECS cluster)

No architectural or procedural changes are required.


Phase 2: Backend Deployment (Singapore) – AWS Console Guide

Overview

This phase deploys the backend services to AWS ECS Fargate in Singapore (ap-southeast-1) using the AWS Management Console.

  • Region: ap-southeast-1

  • Estimated Time: 60–90 minutes


Prerequisites

  • Phase 1 completed

  • Docker installed locally (required for building images)

  • AWS CLI installed locally (required for authentication)


Step 1: Push Docker Images (Local Terminal)

Docker images must be built locally. This step cannot be performed from the AWS Console.

  1. Open your local terminal (VS Code / Terminal).

  2. Login to Amazon ECR (Singapore region):

aws ecr get-login-password --region ap-southeast-1 | docker login --username AWS --password-stdin <YOUR_AWS_ACCOUNT_ID>.dkr.ecr.ap-southeast-1.amazonaws.com
  1. Build the backend image (ARM64 architecture):

docker build --platform linux/arm64 -t helium-backend:latest ./backend
  1. Tag the image:

docker tag helium-backend:latest <YOUR_AWS_ACCOUNT_ID>.dkr.ecr.ap-southeast-1.amazonaws.com/helium-backend:latest
  1. Push the image to ECR:

docker push <YOUR_AWS_ACCOUNT_ID>.dkr.ecr.ap-southeast-1.amazonaws.com/helium-backend:latest

Step 2: Secrets Management

  1. Navigate to AWS Secrets Manager (Singapore region).

  2. Click Store a new secret.

  3. Secret type: Other type of secret.

  4. Add the required key/value pairs.

  5. Name the secret: helium/backend/production.

  6. Store the secret and copy the Secret ARN.


Step 3: IAM Roles

3.1 Create ECS Task Execution Role

  • Trusted entity: Elastic Container Service Task

  • Attach policies:

  • AmazonECSTaskExecutionRolePolicy

  • AmazonEC2ContainerRegistryReadOnly

  • Role name: helium-ecs-execution-role-sg

3.2 Add Secrets Manager Permission

Attach an inline policy allowing secretsmanager:GetSecretValue for the backend secret ARN.


Step 4: ECS Cluster Creation

  • Cluster name: helium-singapore-cluster

  • Infrastructure type: Fargate (serverless)


Step 5: Task Definition

  • Task definition family: helium-backend-sg

  • Launch type: Fargate

  • OS / Architecture: Linux / ARM64

  • Task size:

  • CPU: 0.5 vCPU

  • Memory: 1 GB

  • Container configuration:

  • Name: backend

  • Image: Backend ECR image

  • Port: 8000

  • Secrets injected from AWS Secrets Manager

  • CloudWatch logging enabled


Step 6: Application Load Balancer (ALB)

  • Load balancer type: Application Load Balancer

  • Scheme: Internet-facing

  • Name: helium-sg-alb

  • Listener: HTTPS (443) with ACM certificate

  • Target group:

  • Type: IP

  • Port: 8000

  • Health check path: /api/health


Step 7: Deploy ECS Service

  • Service name: helium-backend-service

  • Desired tasks: 2

  • VPC: helium-singapore-vpc

  • Subnets: Private subnets only

  • Public IP: Disabled

  • Load balancing configuration:

  • Load balancer: helium-sg-alb

  • Listener: HTTPS (443)

  • Target group: helium-backend-tg-sg

  • Container: backend:8000


Verification

  • ECS service reaches Steady state

  • Target group shows healthy targets

  • https://<ALB-DNS-NAME>/api/health returns HTTP 200

  • Optional Route 53 record: api-sg.he2.site


Phase 3: Data & Cache Layer (Singapore) – AWS Console Guide

Overview

This phase creates a Redis (ElastiCache) cluster in Singapore for low-latency caching and sets up an S3 bucket for object storage.

  • Region: ap-southeast-1

  • Estimated Time: 30–45 minutes


Step 1: Redis (ElastiCache)

Navigate to ElastiCache in the AWS Management Console.

1.1 Create Subnet Group

  1. Click Subnet groupsCreate subnet group.

  2. Name: helium-redis-subnet-group-sg.

  3. Description: Singapore Redis Subnets.

  4. VPC: helium-singapore-vpc.

  5. Add subnets: Select both private subnets.

  6. Click Create.


1.2 Create Security Group (EC2)

  1. Navigate to EC2 → Security GroupsCreate security group.

  2. Name: helium-redis-sg.

  3. VPC: helium-singapore-vpc.

  4. Inbound rule:

    • Type: Custom TCP

    • Port: 6379

    • Source: helium-ecs-sg

  5. Click Create.


1.3 Create Redis Cluster

  1. Navigate to ElastiCache → Redis cachesCreate Redis cache.

  2. Deployment option: Design your own cache.

  3. Creation method: Cluster.

  4. Name: helium-redis-sg.

  5. Cluster mode: Disabled (enabled optional for scaling).

  6. Node type: cache.t3.micro.

  7. Number of replicas: 1.

  8. Subnet group: helium-redis-subnet-group-sg.

  9. Security group: helium-redis-sg.

  10. Enable encryption at rest and in transit.

  11. Click Create.


1.4 Update Secrets Manager

  1. Wait until Redis status is Available.

  2. Copy the primary endpoint.

  3. Navigate to Secrets Managerhelium/backend/production.

  4. Edit and update REDIS_HOST.

  5. Save changes.

  6. Force new deployment of backend ECS service.


Step 2: S3 Storage (Singapore)

  1. Navigate to Amazon S3.

  2. Click Create bucket.

  3. Bucket name: helium-files-sg.

  4. Region: ap-southeast-1.

  5. Keep Block all public access enabled.

  6. Enable bucket versioning.

  7. Enable server-side encryption (SSE-S3).

  8. Click Create bucket.


2.1 Configure CORS

  1. Open the bucket → Permissions tab.

  2. Edit CORS configuration and paste the JSON policy.
    [

            {

                "AllowedHeaders": ["*"],

                "AllowedMethods": ["GET", "PUT", "POST", "DELETE", "HEAD"],

                "AllowedOrigins": ["https://he2.site", "https://www.he2.site"],

                "ExposeHeaders": ["ETag"],

                "MaxAgeSeconds": 3000

            }

        ]

  3. Save changes.


Phase 4: Agent Execution (Singapore) – AWS Console Guide

Overview

This phase configures the background worker service in Singapore to process long-running agent and asynchronous tasks using ECS Fargate.

  • Region: ap-southeast-1

  • Estimated Time: 20–30 minutes


Prerequisites

  • Backend ECS service successfully deployed

  • Redis (ElastiCache) cluster in Available state


Step 1: Create Worker ECS Service

The worker service reuses the same backend Docker image but runs with a different container command.

  1. Navigate to ECS → Clusters → helium-singapore-cluster.

  2. Open the Services tab and click Create.

  3. Compute options: FARGATE.

  4. Task definition family: helium-backend-sg.

  5. Service name: helium-worker-service.

  6. Desired tasks: 2.

  7. Networking configuration:

    • VPC: helium-singapore-vpc

    • Subnets: Private subnets

    • Security group: helium-ecs-sg

    • Public IP: Disabled

  8. Container overrides:

    • Container: backend

    • Command: python,run_worker.py

  9. Click Create.


Step 2: Service Auto Scaling (Optional)

  1. Open helium-worker-service.

  2. Update service and enable auto scaling.

  3. Minimum tasks: 1

  4. Maximum tasks: 10

  5. Scaling policy: Target tracking

  6. Metric: ECSServiceAverageCPUUtilization

  7. Target value: 7


Regional Reuse (Virginia & Mumbai)

The same Agent Execution (Phase 4) steps can be repeated for:

  • Virginia (us-east-1)

  • Mumbai (ap-south-1)

Only the following change:

  • AWS Region

  • ECS cluster name

  • Subnets and security groups (region-specific)

No architectural or procedural changes are required.


5. AWS Global Accelerator & Route 53 Integration

1. Purpose of This Document

This section describes the end-to-end setup of AWS Global Accelerator (GA) integrated with multi-region Application Load Balancers (ALBs) and Amazon Route 53. It covers architecture, configuration steps, routing behavior, traffic flow, and best practices from a Cloud Engineer perspective.


2. What is AWS Global Accelerator?

AWS Global Accelerator is a global networking service that improves application availability and performance by directing user traffic to the nearest healthy AWS Region using Anycast static IPs.

Key Characteristics

  • Global service (not region-specific)

  • Provides two static Anycast IP addresses

  • Routes traffic at AWS Edge locations

  • Supports multi-region endpoints (ALB, NLB, EC2, Elastic IP)

  • Offers near-instant regional failover

Important: Global Accelerator is managed from US West (Oregon). This is only the control plane location and does not mean that application traffic flows through Oregon.


3. Architecture Overview

User → Route 53 (DNS) → Global Accelerator (Anycast IP) → Nearest AWS Edge Location → Closest Healthy Region → Regional ALB → ECS Services


4. Why Use a Single Global Accelerator for Multiple Regions

Recommended Design

  • One Global Accelerator per application

  • Multiple endpoint groups (one per AWS Region)

  • Each endpoint group contains the regional ALB

Benefits

  • Automatic latency-based routing

  • Fast and seamless regional failover

  • Simple and clean DNS configuration

  • Lower operational overhead and cost

What NOT to Do

  • Do not create multiple Global Accelerators for the same application

  • Do not use Route 53 latency routing with Global Accelerator


5. Global Accelerator Configuration Steps

Step 1: Create the Global Accelerator

  • Create a new Global Accelerator

  • Listener protocol: TCP

  • Listener ports: 80 / 443 (based on ALB configuration)

  • Save the GA DNS name and static IP addresses

Step 2: Create Endpoint Groups (One Per Region)

Example regions:

  • ap-south-1 (Mumbai)

  • ap-southeast-1 (Singapore)

  • us-east-1 (Virginia)

For each endpoint group:

  • Set traffic dial to 100%

  • Use default health check settings

Step 3: Add ALBs to Endpoint Groups

For each region:

  • Endpoint type: Application Load Balancer

  • Select the ALB from the same region

  • Keep default weight unless traffic tuning is required


6. Security Group Considerations

  • Global Accelerator itself does not have a security group

  • Traffic reaches the ALB from AWS Edge locations

ALB Security Group must allow inbound HTTP/HTTPS traffic from either:

  • 0.0.0.0/0 (simpler, less restrictive)

  • AWS-managed Global Accelerator prefix list (recommended)


7. Route 53 Configuration (DNS Setup)

Correct Record Setup

  • Record name: api

  • Record type: A (IPv4)

  • Alias: Yes

  • Route traffic to: Alias to Global Accelerator

  • Routing policy: Simple routing

  • Evaluate target health: Yes

  • Health check: Not required

Important Notes

  • Route 53 always shows US West (Oregon) for Global Accelerator

  • This is expected behavior

  • Do not select Latency, Geo, or Weighted routing


8. Why Simple Routing is Mandatory

LayerResponsibility
Route 53DNS resolution only
Global AcceleratorLatency routing & failover
ALBApplication load balancing

9. Traffic Flow Explanation

  1. User resolves api.domain.com

  2. Route 53 returns the Global Accelerator IP

  3. User connects to nearest AWS Edge location

  4. Global Accelerator selects the closest healthy region

  5. Traffic forwards to the regional ALB

  6. ALB routes traffic to ECS service tasks


10. High Availability & Failover

  • Endpoint health is continuously monitored

  • If a region becomes unhealthy, traffic is instantly routed to the next healthy region

  • No DNS TTL or propagation delay


11. AWS Global Accelerator – Traffic Capacity

Global Accelerator

  • Designed to handle millions of requests per second

  • Built on the AWS global edge network using Anycast IPs

  • Scales automatically with traffic spikes

Application Load Balancer (per region – approximate soft limits)

  • ~100,000+ requests per second

  • ~3,000 new connections per second

  • ~100,000 active connections

Actual capacity depends on instance size, target type, and request patterns.


12. Conclusion

This architecture provides:

  • Global performance optimization

  • High availability and fast failover

  • Simple and reliable DNS management


6. AWS X-Ray Integration (ECS – Infrastructure Level)

1. Purpose

This section documents the AWS X-Ray integration for ECS-based microservices from an infrastructure and Cloud Engineer perspective. Application-level instrumentation is intentionally excluded and handled by development teams.


2. What is AWS X-Ray

AWS X-Ray is a distributed tracing service used to analyze, debug, and monitor applications by tracking requests as they traverse AWS services.

Key Capabilities

  • End-to-end request tracing across services

  • Visual service maps showing dependencies

  • Latency and error analysis

  • Identification of faults, throttling, and failures

  • Native integration with Amazon CloudWatch


3. When to Use AWS X-Ray

AWS X-Ray is especially useful when:

  • Applications use a microservices architecture

  • Requests span multiple AWS services or regions

  • Latency issues require root-cause analysis

  • Precise failure points must be identified


4. High-Level Architecture Flow

Client Request → Application Load Balancer (ALB) → ECS Service (Application Container) → X-Ray Daemon (Sidecar Container) → AWS X-Ray Service → Service Maps & Traces


5. Cloud Engineer Responsibilities

  • Provision and validate X-Ray infrastructure

  • Add X-Ray daemon as a sidecar container

  • Configure required IAM permissions

  • Ensure compatibility with AWS-managed services

  • Integrate observability with CloudWatch


6. ECS X-Ray Setup (Infrastructure Side)

6.1 Prerequisites

  • ECS services running behind an Application Load Balancer

  • Tasks have outbound internet or NAT access

  • X-Ray supported in the selected AWS region

No account-level enablement is required.


6.2 X-Ray and Application Load Balancer

  • No manual "Enable X-Ray" option exists on ALB

  • ALB automatically injects the X-Amzn-Trace-Id header

  • ALB segments appear only after backend instrumentation

No ALB configuration changes are required.


6.3 ECS Task Definition – X-Ray Daemon (Sidecar)

For each ECS service (backend, worker):

  • Add X-Ray daemon as a sidecar container

  • Container image:

public.ecr.aws/xray/aws-xray-daemon:latest
  • Expose port 2000/UDP

  • Use the same network mode as the application container

  • Runs within the same ECS task

Responsibilities of the X-Ray Daemon

  • Receives trace data from application containers

  • Forwards trace data to AWS X-Ray


6.4 Environment Variables Configuration

Environment variables must be added to both the application container and the X-Ray daemon container.

Backend

  • X-Ray daemon container name: helium-backend-xray

  • Key: AWS_XRAY_DAEMON_ADDRESS

  • Value: helium-backend-xray:2000

Worker

  • X-Ray daemon container name: helium-worker-xray

  • Key: AWS_XRAY_DAEMON_ADDRESS

  • Value: helium-worker-xray:2000

Notes

  • Port 2000/UDP must be exposed

  • Container names act as DNS hostnames inside ECS tasks


6.5 IAM Configuration (Task Role)

Attach the managed policy below to the ECS Task Role:

  • AWSXrayWriteOnlyAccess

This allows:

  • Sending trace segments and subsegments

  • Communication with the X-Ray service

The task execution role does not require X-Ray permissions.


6.6 Deploy Updated ECS Services

  • Register the updated task definition

  • Deploy changes to backend and worker services

  • Confirm X-Ray daemon container is in RUNNING state


6.7 Validate X-Ray Daemon Logs

Expected log messages:

  • Successful initialization

  • Region detection confirmation

  • Get instance id metadata failed warnings (expected in ECS/Fargate)


6.8 Expected State Before Application Instrumentation

  • No service map visible

  • No traces in X-Ray console

This is expected until application code is instrumented and traffic flows.


7. Developer Responsibilities (Reference)

  • Add AWS X-Ray SDKs (Java, Node.js, Python, etc.)

  • Enable tracing in application code

  • Use meaningful service and subsegment names


8. Understanding Service Maps

What Service Maps Show

  • Connected services

  • Request flow paths

  • Error and fault locations

  • Latency between components

How Maps Are Generated

  • Applications send trace data

  • X-Ray builds maps automatically

  • Maps update dynamically with traffic


9. Viewing Traces and Errors

  • AWS Console → X-Ray → Traces

  • AWS Console → X-Ray → Service Map

You can analyze:

  • 4xx / 5xx errors

  • Slow requests

  • Latency breakdown per service

  • Exact failure points


10. Common Issues and Observations

Daemon Running but No Traces

  • Application not instrumented

  • No incoming traffic

  • Wrong AWS region selected

IMDS Errors in Logs

  • Expected in ECS/Fargate

  • Safe to ignore

  • Do not impact trace collection


11. Integration with Amazon CloudWatch

  • X-Ray integrates natively with CloudWatch

  • Metrics can be used for alarms

  • Logs, metrics, and traces can be correlated

CloudWatch = metrics & logs
X-Ray = request-level visibility


12. Best Practices

  • Always run X-Ray daemon as a sidecar

  • Use clear and consistent service names

  • Combine X-Ray with CloudWatch alarms

  • Enable tracing early in lower environments


13. X-Ray Daemon Container Image

Recommended image:

public.ecr.aws/xray/aws-xray-daemon:latest
  • Official AWS-maintained image

  • Actively updated

  • No Docker Hub rate limits

  • Optimized for ECS, EKS, and Fargate


14. Conclusion

AWS X-Ray enables deep visibility into distributed systems. By preparing infrastructure in advance, Cloud Engineers allow development teams to activate tracing seamlessly when application changes are introduced.


7. Staging Deployment – ECS Fargate (Isolated Environment)

Overview

This section documents the staging environment deployment of the Helium application using AWS ECS Fargate. The staging setup mirrors production architecture while remaining fully isolated to safely validate changes before promotion.

The staging environment was created with separate clusters, services, task definitions, load balancer, SSL certificate, and Global Accelerator.


Staging Resources Created

ECS & Compute

  • ECS Cluster: helium-backend-staging-cluster

  • Backend ECS Service: helium-backend-staging-service

  • Worker ECS Service: helium-worker-staging-service

  • Backend Task Definition: helium-backend-staging

  • Worker Task Definition: helium-worker-staging

Networking & Security

  • Dedicated Application Load Balancer for staging

  • Separate Target Groups for backend staging service

Certificates & DNS

  • Separate ACM certificate created for staging domain

  • Certificate attached exclusively to the staging ALB

Global Traffic Management

  • Separate AWS Global Accelerator created for staging

  • Staging ALB registered as an endpoint in the staging Global Accelerator

  • Isolation maintained between production and staging traffic paths


Staging Architecture Characteristics

  • Complete isolation from production resources

  • Same ECS Fargate launch type

  • Independent deployment lifecycle

  • Safe environment for testing infrastructure, scaling, and application changes


Deployment Flow (Staging)

  1. Build and push Docker images (same images reused or staging-tagged as required)

  2. Register staging task definitions (helium-backend-staging, helium-worker-staging)

  3. Deploy backend ECS service in helium-backend-staging-cluster

  4. Deploy worker ECS service with background execution command

  5. Attach staging services to the staging ALB and target groups

  6. Validate health checks and service steady state

  7. Register staging ALB with the staging Global Accelerator

  8. Validate end-to-end traffic flow through Global Accelerator → ALB → ECS services


Purpose of Staging Environment

  • Validate ECS task definition changes

  • Test scaling behavior and resource limits

  • Verify Global Accelerator and ALB routing

  • Validate certificates and HTTPS termination

  • Perform safe functional and performance testing before production release


8. CI/CD – ECS Fargate Multi-Region Deployment

This section documents the CI/CD pipeline used to build, push, and deploy Helium backend and worker services across staging and production environments using GitHub Actions and Amazon ECS.

The pipeline supports:

  • Branch-based deployments

  • Multi-region Docker image builds

  • Safe staging deployments

  • Controlled production rollouts


CI/CD Workflow Definition

name: CICD

# =====================================================
# ๐Ÿš€ TRIGGERS
# =====================================================
# - beta     → STAGING deploy
# - migrated-demo2 → PRODUCTION deploy
# =====================================================
on:
  push:
    branches:
      - beta
      - migrated-demo2
  workflow_dispatch:

# =====================================================
# ๐ŸŒ GLOBAL VARIABLES
# =====================================================
env:
  IMAGE_TAG: v${{ github.run_number }}
  ECR_REPO: helium-backend

# =====================================================
# ๐Ÿ—️ BUILD & PUSH DOCKER IMAGES
# =====================================================
# - Builds ARM64 image
# - Pushes to ECR in ALL required regions
# - Runs for BOTH staging and production branches
# =====================================================
jobs:
  build:
    name: Build & Push Docker Images
    runs-on: ubuntu-latest

    strategy:
      matrix:
        region:
          - ap-south-1        # Mumbai
          - ap-southeast-1    # Singapore
          - us-east-1         # Virginia

    outputs:
      image_ap: ${{ steps.export.outputs.image_ap }}
      image_sg: ${{ steps.export.outputs.image_sg }}
      image_us: ${{ steps.export.outputs.image_us }}

    steps:
      - name: Checkout Code
        uses: actions/checkout@v4

      - name: Setup Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v3
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ matrix.region }}

      - name: Login to Amazon ECR
        id: login
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build & Push Image
        run: |
          IMAGE_URI=${{ steps.login.outputs.registry }}/${{ env.ECR_REPO }}:${{ env.IMAGE_TAG }}

          docker buildx build \
            --platform linux/arm64 \
            -t $IMAGE_URI \
            --push \
            -f backend/Dockerfile backend

          echo "IMAGE_URI=$IMAGE_URI" >> $GITHUB_ENV

      - name: Export Image Output
        id: export
        run: |
          if [ "${{ matrix.region }}" = "ap-south-1" ]; then
            echo "image_ap=$IMAGE_URI" >> $GITHUB_OUTPUT
          elif [ "${{ matrix.region }}" = "ap-southeast-1" ]; then
            echo "image_sg=$IMAGE_URI" >> $GITHUB_OUTPUT
          else
            echo "image_us=$IMAGE_URI" >> $GITHUB_OUTPUT
          fi

# =====================================================
# ๐Ÿ”ต DEPLOY STAGING (Mumbai ONLY)
# =====================================================
  deploy-staging:
    name: Deploy STAGING (Mumbai)
    runs-on: ubuntu-latest
    needs: build

    if: github.ref_name == 'beta'

    strategy:
      matrix:
        include:
          - task_type: backend
            container: helium-backend-staging
            td: helium-backend-staging
            svc: helium-backend-staging-service
            cluster: helium-mumbai-staging-cluster

          - task_type: worker
            container: helium-worker-staging
            td: helium-worker-staging
            svc: helium-worker-staging-service
            cluster: helium-mumbai-staging-cluster

    steps:
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v3
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ap-south-1

      - name: Download Task Definition
        run: |
          aws ecs describe-task-definition \
            --task-definition ${{ matrix.td }} \
            --query taskDefinition \
          | jq 'del(.taskDefinitionArn,.revision,.status,.requiresAttributes,.compatibilities,.registeredAt,.registeredBy)' > task.json

      - name: Render Task Definition
        uses: aws-actions/amazon-ecs-render-task-definition@v1
        id: render
        with:
          task-definition: task.json
          container-name: ${{ matrix.container }}
          image: ${{ needs.build.outputs.image_ap }}

      - name: Deploy to ECS
        uses: aws-actions/amazon-ecs-deploy-task-definition@v1
        with:
          task-definition: ${{ steps.render.outputs.task-definition }}
          service: ${{ matrix.svc }}
          cluster: ${{ matrix.cluster }}
          wait-for-service-stability: true

# =====================================================
# ๐Ÿ”ด DEPLOY PRODUCTION (MULTI-REGION)
# =====================================================
  deploy-production:
    name: Deploy PRODUCTION
    runs-on: ubuntu-latest
    needs: build

    if: github.ref_name == 'migrated-demo2'

    strategy:
      matrix:
        include:
          - region: ap-south-1
            image: ${{ needs.build.outputs.image_ap }}
            cluster: helium-mumbai-cluster
            container: helium-backend
            td: helium-backend
            svc: helium-backend-service

          - region: ap-south-1
            image: ${{ needs.build.outputs.image_ap }}
            cluster: helium-mumbai-cluster
            container: helium-worker
            td: helium-worker
            svc: helium-worker-service

          - region: ap-southeast-1
            image: ${{ needs.build.outputs.image_sg }}
            cluster: helium-singapore-cluster
            container: helium-backend
            td: helium-backend
            svc: helium-backend-service

          - region: ap-southeast-1
            image: ${{ needs.build.outputs.image_sg }}
            cluster: helium-singapore-cluster
            container: helium-worker
            td: helium-worker
            svc: helium-worker-service

          - region: us-east-1
            image: ${{ needs.build.outputs.image_us }}
            cluster: helium-virginia-cluster
            container: helium-backend
            td: helium-backend
            svc: helium-backend-service

          - region: us-east-1
            image: ${{ needs.build.outputs.image_us }}
            cluster: helium-virginia-cluster
            container: helium-worker
            td: helium-worker
            svc: helium-worker-service

    steps:
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v3
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ matrix.region }}

      - name: Download Task Definition
        run: |
          aws ecs describe-task-definition \
            --task-definition ${{ matrix.td }} \
            --query taskDefinition \
          | jq 'del(.taskDefinitionArn,.revision,.status,.requiresAttributes,.compatibilities,.registeredAt,.registeredBy)' > task.json

      - name: Render Task Definition
        uses: aws-actions/amazon-ecs-render-task-definition@v1
        id: render
        with:
          task-definition: task.json
          container-name: ${{ matrix.container }}
          image: ${{ matrix.image }}

      - name: Deploy to ECS
        uses: aws-actions/amazon-ecs-deploy-task-definition@v1
        with:
          task-definition: ${{ steps.render.outputs.task-definition }}
          service: ${{ matrix.svc }}
          cluster: ${{ matrix.cluster }}
          wait-for-service-stability: true

9. Staging Deployment & CI/CD Pipeline (Added – Verbatim)

This section is added at the end of the AWS Architecture – Initial Review & High‑level Design (Day 1) documentation, as requested. The content below is incorporated as‑is from the Staging Deployment & CI/CD Pipeline Documentation and aligns with the staging and CI/CD sections already described above. No existing content has been removed or altered.

Scope Covered

  • Isolated staging environment using ECS Fargate

  • Separate staging ECS clusters, services, and task definitions

  • Dedicated staging Application Load Balancer and ACM certificate

  • Independent AWS Global Accelerator for staging traffic

  • GitHub Actions based CI/CD pipeline supporting:

    • Branch‑based deployments (beta → staging, migrated-demo2 → production)

    • Multi‑region Docker image builds

    • Safe, controlled rollouts to ECS services

Key Staging Resources

  • ECS Cluster: helium-backend-staging-cluster

  • Backend Service: helium-backend-staging-service

  • Worker Service: helium-worker-staging-service

  • Task Definitions:

    • helium-backend-staging

    • helium-worker-staging

  • Dedicated ALB, target groups, ACM certificate, and Global Accelerator

CI/CD Overview

  • Docker images are built once per commit (ARM64)

  • Images are pushed to ECR in Mumbai, Singapore, and Virginia

  • Staging deploys are automatically triggered from the beta branch

  • Production deploys are gated and triggered from migrated-demo2

  • ECS task definitions are rendered dynamically and deployed with service‑stability checks

This completes the Day‑1 architecture documentation by extending it through staging isolation and automated delivery pipelines, ensuring a full lifecycle view from design to deployment.


10. Monitoring & Alerting (AWS Managed Grafana & CloudWatch)

This section documents the monitoring and alerting setup implemented using Amazon CloudWatch and AWS Managed Grafana to ensure visibility into system health and proactive notifications.

Overview

Monitoring was configured for ECS Fargate services (backend and worker) across environments to track CPU and memory utilization, visualize metrics, and trigger alerts based on defined thresholds.

CloudWatch Metrics & Alarms

The following CloudWatch alarms were created for all ECS services:

  • High CPU Utilization Alarm

    • Metric: CPUUtilization

    • Threshold: Greater than 70%

    • Evaluation: Sustained breach over configured periods

    • Action: Trigger alarm notification

  • High Memory Utilization Alarm

    • Metric: MemoryUtilization

    • Threshold: Greater than 70%

    • Evaluation: Sustained breach over configured periods

    • Action: Trigger alarm notification

  • Normal Utilization (Recovery) Alarm

    • Metric: CPU and Memory Utilization

    • Threshold: Below 60%

    • Purpose: Indicate service has returned to a normal operating range

    • Action: Send notification email indicating system stability

Alerting & Notifications

  • Amazon SNS was used as the notification mechanism for alarms

  • Email subscriptions were configured to receive:

    • Alerts when utilization crosses critical thresholds

    • Notifications when services return to normal utilization levels

AWS Managed Grafana Integration

  • AWS Managed Grafana workspace was created

  • CloudWatch was added as a data source

  • Dashboards were configured to visualize:

    • ECS service CPU utilization

    • ECS service memory utilization

    • Service-level performance trends

  • Grafana provides real-time and historical visibility into service health

Purpose & Benefits

  • Early detection of performance bottlenecks

  • Clear visibility into ECS service behavior

  • Proactive alerting before user impact

  • Confirmation notifications when systems stabilize

This monitoring setup ensures the platform remains observable, reliable, and operationally ready across staging and production environments.

Comments

Popular posts from this blog

Staging Deployment & CI/CD Pipeline Documentation

AWS Global Accelerator (GA) & Route 53 Integration Documentation