End-To-End Documentation

This document provides a consolidated, end-to-end view of the architecture design, infrastructure experiments, and Kubernetes-based deployment for the Helium application. It is structured as a practical engineering reference and runnable runbook.

1. Architecture Review & Design

1.1 Architecture Understanding

Reviewed the provided AWS architecture diagram in detail
Identified major components, service boundaries, and dependencies
Understood data flow between frontend, backend, worker, and data layers
Analyzed interactions between AWS-managed services and external integrations

1.2 Architecture Diagram Creation

Created an updated AWS architecture diagram
Ensured clear separation of:
Frontend layer
Backend API services
Asynchronous worker services
Data and caching components
Represented VPC boundaries, load balancers, and external service integrations

1.3 AWS Services Studied

Amazon VPC and subnet design
ECS Fargate for container-based workloads
Application Load Balancer (ALB)
NAT Gateway for outbound connectivity
IAM roles and permissions
Amazon SQS for asynchronous processing

1.4 Architectural Enhancements Incorporated

AWS Global Accelerator

Considered for improving global traffic routing and reducing latency

Asynchronous Processing with SQS

SQS positioned between Backend API and Worker services
Designed to decouple synchronous API requests from background processing

1.5 Data Flow Design

Frontend → Backend API
Backend API → SQS
Worker services → Cache / external services

2. ECS, Fargate & Core Infrastructure Setup (Failed During Initial Configuration)

2.1 Fargate & ElastiCache Redis Integration

Attempted integration of backend and worker Fargate containers with ElastiCache Redis
Reviewed VPC and subnet requirements for Redis connectivity
Configured security group rules for Redis access
Updated ECS task IAM roles for backend and worker containers
Resolved S3 permission issues related to s3:HeadObject vs s3:GetObject
Reviewed ECS task role, execution role, and IAM policy behavior

2.2 Debugging & IAM Fixes

Investigated issues related to:
Host port visibility in task definitions
Essential container configuration
Missing S3 permissions
Corrected invalid IAM actions causing Invalid Action: s3:HeadObject errors

2.3 ElastiCache (Redis / Valkey) Setup

Created an ElastiCache Redis (Valkey) cluster inside the VPC
Restricted port 6379 access to the Fargate task security group

2.4 ECS Services & Networking Fixes

Created backend and worker ECS Fargate services
Fixed subnet configuration for frontend, backend, and worker services
Verified ENI attachment and internal VPC routing
Verified ALB → backend service connectivity
Investigated ECS service rollback failures using:
ECS service events
Task definitions
IAM permissions
Verified IAM access for ECS, ECR, CloudWatch Logs, and VPC operations

2.5 Deployment & Resource Configuration

Continued ECS Fargate deployment troubleshooting
Reviewed CPU and memory configuration at task and container levels
Validated ALB target group behavior and frontend–backend traffic flow

3. Kubernetes (kind) Based Deployment & Execution

3.1 Docker Image Build & Push (Docker Hub)

docker login

docker build -t <dockerhub-username>/helium-frontend:latest ./frontend
docker build -t <dockerhub-username>/helium-backend:latest ./backend


docker push <dockerhub-username>/helium-frontend:latest
docker push <dockerhub-username>/helium-backend:latest

3.2 Kubernetes Tooling Installation

#!/bin/bash

set -e # Exit immediately if a command exits with a non-zero status

echo "Updating package index..."

sudo apt-get update

echo "Installing prerequisites..."

sudo apt-get install -y apt-transport-https ca-certificates curl software-properties-common

echo "Adding Docker's official GPG key..."

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg

echo "Setting up Docker stable repository..."

echo \

"deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \

$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

echo "Installing Docker..."

sudo apt-get update

sudo apt-get install -y docker-ce docker-ce-cli containerd.io

echo "Starting and enabling Docker..."

sudo systemctl start docker

sudo systemctl enable docker

echo "Adding current user to docker group..."

sudo usermod -aG docker $USER

echo "Please log out and log back in so that group changes take effect."

# ------------------------

echo "Installing kind..."

[ $(uname -m) = x86_64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.27.0/kind-linux-amd64

chmod +x ./kind

sudo mv ./kind /usr/local/bin/kind

echo "Installing kubectl..."

VERSION="v1.30.0"

URL="https://dl.k8s.io/release/${VERSION}/bin/linux/amd64/kubectl"

INSTALL_DIR="/usr/local/bin"

curl -LO "$URL"

chmod +x kubectl

sudo mv kubectl $INSTALL_DIR/

kubectl version --client

echo "Cleaning up..."

rm -f kubectl

rm -f kind

echo "kind & kubectl installation complete!"\

chmod +x install.sh
./install.sh

3.3 Kubernetes Cluster Creation

kind create cluster --config kind-cluster.yaml
kubectl get nodes

3.3.1 Namespace & Secrets Setup

kubectl create namespace helium
kubectl apply -n helium -f helium-backend-secret.yaml
kubectl apply -n helium -f helium-worker-secret.yaml

3.3.2 Application Deployment

kubectl apply -n helium -f helium-app.yaml
kubectl get pods -n helium
kubectl get svc -n helium

helium-app.yaml

# (File added exactly as provided by the user)

apiVersion: v1
kind: Namespace
metadata:
  name: helium
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: helium-frontend
  namespace: helium
spec:
  replicas: 2
  selector:
    matchLabels:
      app: helium-frontend
  template:
    metadata:
      labels:
        app: helium-frontend
    spec:
      containers:
        - name: frontend
          image: <dockerhub-username>/helium-frontend:latest
          ports:
            - containerPort: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: helium-backend
  namespace: helium
spec:
  replicas: 2
  selector:
    matchLabels:
      app: helium-backend
  template:
    metadata:
      labels:
        app: helium-backend
    spec:
      containers:
        - name: backend
          image: <dockerhub-username>/helium-backend:latest
          ports:
            - containerPort: 8000
          envFrom:
            - secretRef:
                name: helium-backend-secret
          resources:
            requests:
              cpu: "250m"
              memory: "512Mi"
            limits:
              cpu: "500m"
              memory: "1Gi"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: helium-worker
  namespace: helium
spec:
  replicas: 1
  selector:
    matchLabels:
      app: helium-worker
  template:
    metadata:
      labels:
        app: helium-worker
    spec:
      containers:
        - name: worker
          image: <dockerhub-username>/helium-worker:latest
          envFrom:
            - secretRef:
                name: helium-worker-secret
          resources:
            requests:
              cpu: "250m"
              memory: "512Mi"
            limits:
              cpu: "500m"
              memory: "1Gi"
---
apiVersion: v1
kind: Service
metadata:
  name: helium-frontend
  namespace: helium
spec:
  type: NodePort
  selector:
    app: helium-frontend
  ports:
    - port: 80
      targetPort: 80
      nodePort: 30000
---
apiVersion: v1
kind: Service
metadata:
  name: helium-backend
  namespace: helium
spec:
  type: NodePort
  selector:
    app: helium-backend
  ports:
    - port: 8000
      targetPort: 8000
      nodePort: 30001
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: helium-backend-hpa
  namespace: helium
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: helium-backend
  minReplicas: 2
  maxReplicas: 6
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: helium-worker-hpa
  namespace: helium
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: helium-worker
  minReplicas: 1
  maxReplicas: 5
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

3.4 Nginx Reverse Proxy Setup & Configuration

sudo apt update
sudo apt install -y nginx
sudo cp helium.conf /etc/nginx/sites-available/helium
sudo ln -s /etc/nginx/sites-available/helium /etc/nginx/sites-enabled/helium
sudo nginx -t
sudo systemctl restart nginx

helium.conf

server {
    listen 80;
    server_name _;

    # FRONTEND
    location / {
        proxy_pass http://127.0.0.1:30000;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

    # BACKEND API
    location /api/ {
        proxy_pass http://127.0.0.1:30001;
        proxy_http_version 1.1;

        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }
}

Traffic Flow

Browser → EC2 Public IP (port 80)
/ routed to frontend NodePort 30000
/api/ routed to backend NodePort 30001

3.5 Horizontal Pod Autoscaling

HPA definitions are included directly in helium-app.yaml
Autoscaling is applied automatically during application deployment

3.6 Metrics Server Installation

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
kubectl get pods -n kube-system | grep metrics-server

3.7 Scaling Validation

watch kubectl get pods -n helium

Traffic was generated against the application UI to validate backend and worker pod autoscaling behavior

3.8 How to Run & Verify (Quick Flow)

# Build & push images
# Install tools
# Create cluster
# Deploy application
# Configure Nginx

4. AWS Infrastructure Foundation – Singapore (Detailed, Step-by-Step)

The following section is included verbatim from the original implementation guide. No steps or technical details have been removed.

Phase 1: Infrastructure Foundation (Singapore)

Overview

This phase sets up the base network infrastructure in Singapore (ap-southeast-1) using the AWS Management Console.

Region: ap-southeast-1 (Singapore)
Estimated Time: 45–60 minutes

Prerequisites

AWS Account with administrative access
Logged into AWS Console: https://console.aws.amazon.com
Region switched to Singapore (ap-southeast-1)

Step 1: Virtual Private Cloud (VPC) Setup

1.1 Create VPC

Navigate to VPC service.
Click Create VPC.
VPC settings:
- Resources to create: VPC only
- Name tag: helium-singapore-vpc
- IPv4 CIDR block: 10.0.0.0/16
- Tenancy: Default
Click Create VPC.

1.2 Create Subnets

Click Subnets → Create subnet.
Select VPC ID: helium-singapore-vpc.

Public Subnets

helium-public-subnet-1 | AZ: ap-southeast-1a | CIDR: 10.0.1.0/24
helium-public-subnet-2 | AZ: ap-southeast-1b | CIDR: 10.0.2.0/24

Private Subnets

helium-private-subnet-1 | AZ: ap-southeast-1a | CIDR: 10.0.10.0/24
helium-private-subnet-2 | AZ: ap-southeast-1b | CIDR: 10.0.11.0/24

Click Create subnet.

1.3 Internet Gateway (IGW)

Navigate to Internet gateways → Create internet gateway.
Name tag: helium-igw.
Create and attach to VPC helium-singapore-vpc.

1.4 NAT Gateway

Navigate to NAT gateways → Create NAT gateway.
Name: helium-nat-gw.
Subnet: helium-public-subnet-1.
Connectivity type: Public.
Allocate Elastic IP.
Click Create NAT gateway.

1.5 Route Tables

Public Route Table

Name: helium-public-rt
Route: 0.0.0.0/0 → Internet Gateway
Associate public subnets

Private Route Table

Name: helium-private-rt
Route: 0.0.0.0/0 → NAT Gateway
Associate private subnets

Step 2: Security Groups

2.1 Load Balancer Security Group

Name: helium-alb-sg
Inbound rules:
HTTP (80) from 0.0.0.0/0
HTTPS (443) from 0.0.0.0/0

2.2 ECS Task Security Group

Name: helium-ecs-sg
Inbound rules:
TCP 8000 from helium-alb-sg
TCP 3000 from helium-alb-sg

Step 3: ECR Repositories

Navigate to Elastic Container Registry (ECR).
Create private repositories:
- helium-backend

Step 4: SSL Certificate (ACM)

Navigate to Certificate Manager.
Request a public certificate.
Domain: he2.site
Additional name: *.he2.site
Validation method: DNS.
Create DNS records in Route 53.

Step 5: Verification Checklist

VPC helium-singapore-vpc exists
Public subnets route to IGW
Private subnets route to NAT Gateway
Security group rules validated
ECR repositories created
ACM certificate issued

Regional Reuse (Virginia & Mumbai)

The exact same steps above are repeated for:

Virginia (us-east-1)
Mumbai (ap-south-1)

Only the following change:

AWS Region
Availability Zones
Resource names (VPC, subnets, ALB, ECS cluster)

No architectural or procedural changes are required.

Phase 2: Backend Deployment (Singapore) – AWS Console Guide

Overview

This phase deploys the backend services to AWS ECS Fargate in Singapore (ap-southeast-1) using the AWS Management Console.

Region: ap-southeast-1
Estimated Time: 60–90 minutes

Prerequisites

Phase 1 completed
Docker installed locally (required for building images)
AWS CLI installed locally (required for authentication)

Step 1: Push Docker Images (Local Terminal)

Docker images must be built locally. This step cannot be performed from the AWS Console.

Open your local terminal (VS Code / Terminal).
Login to Amazon ECR (Singapore region):

aws ecr get-login-password --region ap-southeast-1 | docker login --username AWS --password-stdin <YOUR_AWS_ACCOUNT_ID>.dkr.ecr.ap-southeast-1.amazonaws.com

Build the backend image (ARM64 architecture):

docker build --platform linux/arm64 -t helium-backend:latest ./backend

Tag the image:

docker tag helium-backend:latest <YOUR_AWS_ACCOUNT_ID>.dkr.ecr.ap-southeast-1.amazonaws.com/helium-backend:latest

Push the image to ECR:

docker push <YOUR_AWS_ACCOUNT_ID>.dkr.ecr.ap-southeast-1.amazonaws.com/helium-backend:latest

Step 2: Secrets Management

Navigate to AWS Secrets Manager (Singapore region).
Click Store a new secret.
Secret type: Other type of secret.
Add the required key/value pairs.
Name the secret: helium/backend/production.
Store the secret and copy the Secret ARN.

Step 3: IAM Roles

3.1 Create ECS Task Execution Role

Trusted entity: Elastic Container Service Task
Attach policies:
AmazonECSTaskExecutionRolePolicy
AmazonEC2ContainerRegistryReadOnly
Role name: helium-ecs-execution-role-sg

3.2 Add Secrets Manager Permission

Attach an inline policy allowing secretsmanager:GetSecretValue for the backend secret ARN.

Step 4: ECS Cluster Creation

Cluster name: helium-singapore-cluster
Infrastructure type: Fargate (serverless)

Step 5: Task Definition

Task definition family: helium-backend-sg
Launch type: Fargate
OS / Architecture: Linux / ARM64
Task size:
CPU: 0.5 vCPU
Memory: 1 GB
Container configuration:
Name: backend
Image: Backend ECR image
Port: 8000
Secrets injected from AWS Secrets Manager
CloudWatch logging enabled

Step 6: Application Load Balancer (ALB)

Load balancer type: Application Load Balancer
Scheme: Internet-facing
Name: helium-sg-alb
Listener: HTTPS (443) with ACM certificate
Target group:
Type: IP
Port: 8000
Health check path: /api/health

Step 7: Deploy ECS Service

Service name: helium-backend-service
Desired tasks: 2
VPC: helium-singapore-vpc
Subnets: Private subnets only
Public IP: Disabled
Load balancing configuration:
Load balancer: helium-sg-alb
Listener: HTTPS (443)
Target group: helium-backend-tg-sg
Container: backend:8000

Verification

ECS service reaches Steady state
Target group shows healthy targets
https://<ALB-DNS-NAME>/api/health returns HTTP 200
Optional Route 53 record: api-sg.he2.site

Phase 3: Data & Cache Layer (Singapore) – AWS Console Guide

Overview

This phase creates a Redis (ElastiCache) cluster in Singapore for low-latency caching and sets up an S3 bucket for object storage.

Region: ap-southeast-1
Estimated Time: 30–45 minutes

Step 1: Redis (ElastiCache)

Navigate to ElastiCache in the AWS Management Console.

1.1 Create Subnet Group

Click Subnet groups → Create subnet group.
Name: helium-redis-subnet-group-sg.
Description: Singapore Redis Subnets.
VPC: helium-singapore-vpc.
Add subnets: Select both private subnets.
Click Create.

1.2 Create Security Group (EC2)

Navigate to EC2 → Security Groups → Create security group.
Name: helium-redis-sg.
VPC: helium-singapore-vpc.
Inbound rule:
- Type: Custom TCP
- Port: 6379
- Source: helium-ecs-sg
Click Create.

1.3 Create Redis Cluster

Navigate to ElastiCache → Redis caches → Create Redis cache.
Deployment option: Design your own cache.
Creation method: Cluster.
Name: helium-redis-sg.
Cluster mode: Disabled (enabled optional for scaling).
Node type: cache.t3.micro.
Number of replicas: 1.
Subnet group: helium-redis-subnet-group-sg.
Security group: helium-redis-sg.
Enable encryption at rest and in transit.
Click Create.

1.4 Update Secrets Manager

Wait until Redis status is Available.
Copy the primary endpoint.
Navigate to Secrets Manager → helium/backend/production.
Edit and update REDIS_HOST.
Save changes.
Force new deployment of backend ECS service.

Step 2: S3 Storage (Singapore)

Navigate to Amazon S3.
Click Create bucket.
Bucket name: helium-files-sg.
Region: ap-southeast-1.
Keep Block all public access enabled.
Enable bucket versioning.
Enable server-side encryption (SSE-S3).
Click Create bucket.

2.1 Configure CORS

Open the bucket → Permissions tab.
Edit CORS configuration and paste the JSON policy.
[
{
"AllowedHeaders": ["*"],
"AllowedMethods": ["GET", "PUT", "POST", "DELETE", "HEAD"],
"AllowedOrigins": ["https://he2.site", "https://www.he2.site"],
"ExposeHeaders": ["ETag"],
"MaxAgeSeconds": 3000
}
]
Save changes.

Phase 4: Agent Execution (Singapore) – AWS Console Guide

Overview

This phase configures the background worker service in Singapore to process long-running agent and asynchronous tasks using ECS Fargate.

Region: ap-southeast-1
Estimated Time: 20–30 minutes

Prerequisites

Backend ECS service successfully deployed
Redis (ElastiCache) cluster in Available state

Step 1: Create Worker ECS Service

The worker service reuses the same backend Docker image but runs with a different container command.

Navigate to ECS → Clusters → helium-singapore-cluster.
Open the Services tab and click Create.
Compute options: FARGATE.
Task definition family: helium-backend-sg.
Service name: helium-worker-service.
Desired tasks: 2.
Networking configuration:
- VPC: helium-singapore-vpc
- Subnets: Private subnets
- Security group: helium-ecs-sg
- Public IP: Disabled
Container overrides:
- Container: backend
- Command: python,run_worker.py
Click Create.

Step 2: Service Auto Scaling (Optional)

Open helium-worker-service.
Update service and enable auto scaling.
Minimum tasks: 1
Maximum tasks: 10
Scaling policy: Target tracking
Metric: ECSServiceAverageCPUUtilization
Target value: 7

Regional Reuse (Virginia & Mumbai)

The same Agent Execution (Phase 4) steps can be repeated for:

Virginia (us-east-1)
Mumbai (ap-south-1)

Only the following change:

AWS Region
ECS cluster name
Subnets and security groups (region-specific)

No architectural or procedural changes are required.

5. AWS Global Accelerator & Route 53 Integration

1. Purpose of This Document

This section describes the end-to-end setup of AWS Global Accelerator (GA) integrated with multi-region Application Load Balancers (ALBs) and Amazon Route 53. It covers architecture, configuration steps, routing behavior, traffic flow, and best practices from a Cloud Engineer perspective.

2. What is AWS Global Accelerator?

AWS Global Accelerator is a global networking service that improves application availability and performance by directing user traffic to the nearest healthy AWS Region using Anycast static IPs.

Key Characteristics

Global service (not region-specific)
Provides two static Anycast IP addresses
Routes traffic at AWS Edge locations
Supports multi-region endpoints (ALB, NLB, EC2, Elastic IP)
Offers near-instant regional failover

Important: Global Accelerator is managed from US West (Oregon). This is only the control plane location and does not mean that application traffic flows through Oregon.

3. Architecture Overview

User → Route 53 (DNS) → Global Accelerator (Anycast IP) → Nearest AWS Edge Location → Closest Healthy Region → Regional ALB → ECS Services

4. Why Use a Single Global Accelerator for Multiple Regions

Recommended Design

One Global Accelerator per application
Multiple endpoint groups (one per AWS Region)
Each endpoint group contains the regional ALB

Benefits

Automatic latency-based routing
Fast and seamless regional failover
Simple and clean DNS configuration
Lower operational overhead and cost

What NOT to Do

Do not create multiple Global Accelerators for the same application
Do not use Route 53 latency routing with Global Accelerator

5. Global Accelerator Configuration Steps

Step 1: Create the Global Accelerator

Create a new Global Accelerator
Listener protocol: TCP
Listener ports: 80 / 443 (based on ALB configuration)
Save the GA DNS name and static IP addresses

Step 2: Create Endpoint Groups (One Per Region)

Example regions:

ap-south-1 (Mumbai)
ap-southeast-1 (Singapore)
us-east-1 (Virginia)

For each endpoint group:

Set traffic dial to 100%
Use default health check settings

Step 3: Add ALBs to Endpoint Groups

For each region:

Endpoint type: Application Load Balancer
Select the ALB from the same region
Keep default weight unless traffic tuning is required

6. Security Group Considerations

Global Accelerator itself does not have a security group
Traffic reaches the ALB from AWS Edge locations

ALB Security Group must allow inbound HTTP/HTTPS traffic from either:

0.0.0.0/0 (simpler, less restrictive)
AWS-managed Global Accelerator prefix list (recommended)

7. Route 53 Configuration (DNS Setup)

Correct Record Setup

Record name: api
Record type: A (IPv4)
Alias: Yes
Route traffic to: Alias to Global Accelerator
Routing policy: Simple routing
Evaluate target health: Yes
Health check: Not required

Important Notes

Route 53 always shows US West (Oregon) for Global Accelerator
This is expected behavior
Do not select Latency, Geo, or Weighted routing

8. Why Simple Routing is Mandatory

Layer	Responsibility
Route 53	DNS resolution only
Global Accelerator	Latency routing & failover
ALB	Application load balancing

9. Traffic Flow Explanation

User resolves api.domain.com
Route 53 returns the Global Accelerator IP
User connects to nearest AWS Edge location
Global Accelerator selects the closest healthy region
Traffic forwards to the regional ALB
ALB routes traffic to ECS service tasks

10. High Availability & Failover

Endpoint health is continuously monitored
If a region becomes unhealthy, traffic is instantly routed to the next healthy region
No DNS TTL or propagation delay

11. AWS Global Accelerator – Traffic Capacity

Global Accelerator

Designed to handle millions of requests per second
Built on the AWS global edge network using Anycast IPs
Scales automatically with traffic spikes

Application Load Balancer (per region – approximate soft limits)

~100,000+ requests per second
~3,000 new connections per second
~100,000 active connections

Actual capacity depends on instance size, target type, and request patterns.

12. Conclusion

This architecture provides:

Global performance optimization
High availability and fast failover
Simple and reliable DNS management

6. AWS X-Ray Integration (ECS – Infrastructure Level)

1. Purpose

This section documents the AWS X-Ray integration for ECS-based microservices from an infrastructure and Cloud Engineer perspective. Application-level instrumentation is intentionally excluded and handled by development teams.

2. What is AWS X-Ray

AWS X-Ray is a distributed tracing service used to analyze, debug, and monitor applications by tracking requests as they traverse AWS services.

Key Capabilities

End-to-end request tracing across services
Visual service maps showing dependencies
Latency and error analysis
Identification of faults, throttling, and failures
Native integration with Amazon CloudWatch

3. When to Use AWS X-Ray

AWS X-Ray is especially useful when:

Applications use a microservices architecture
Requests span multiple AWS services or regions
Latency issues require root-cause analysis
Precise failure points must be identified

4. High-Level Architecture Flow

Client Request → Application Load Balancer (ALB) → ECS Service (Application Container) → X-Ray Daemon (Sidecar Container) → AWS X-Ray Service → Service Maps & Traces

5. Cloud Engineer Responsibilities

Provision and validate X-Ray infrastructure
Add X-Ray daemon as a sidecar container
Configure required IAM permissions
Ensure compatibility with AWS-managed services
Integrate observability with CloudWatch

6. ECS X-Ray Setup (Infrastructure Side)

6.1 Prerequisites

ECS services running behind an Application Load Balancer
Tasks have outbound internet or NAT access
X-Ray supported in the selected AWS region

No account-level enablement is required.

6.2 X-Ray and Application Load Balancer

No manual "Enable X-Ray" option exists on ALB
ALB automatically injects the X-Amzn-Trace-Id header
ALB segments appear only after backend instrumentation

No ALB configuration changes are required.

6.3 ECS Task Definition – X-Ray Daemon (Sidecar)

For each ECS service (backend, worker):

Add X-Ray daemon as a sidecar container
Container image:

public.ecr.aws/xray/aws-xray-daemon:latest

Expose port 2000/UDP
Use the same network mode as the application container
Runs within the same ECS task

Responsibilities of the X-Ray Daemon

Receives trace data from application containers
Forwards trace data to AWS X-Ray

6.4 Environment Variables Configuration

Environment variables must be added to both the application container and the X-Ray daemon container.

Backend

X-Ray daemon container name: helium-backend-xray
Key: AWS_XRAY_DAEMON_ADDRESS
Value: helium-backend-xray:2000

Worker

X-Ray daemon container name: helium-worker-xray
Key: AWS_XRAY_DAEMON_ADDRESS
Value: helium-worker-xray:2000

Notes

Port 2000/UDP must be exposed
Container names act as DNS hostnames inside ECS tasks

6.5 IAM Configuration (Task Role)

Attach the managed policy below to the ECS Task Role:

AWSXrayWriteOnlyAccess

This allows:

Sending trace segments and subsegments
Communication with the X-Ray service

The task execution role does not require X-Ray permissions.

6.6 Deploy Updated ECS Services

Register the updated task definition
Deploy changes to backend and worker services
Confirm X-Ray daemon container is in RUNNING state

6.7 Validate X-Ray Daemon Logs

Expected log messages:

Successful initialization
Region detection confirmation
Get instance id metadata failed warnings (expected in ECS/Fargate)

6.8 Expected State Before Application Instrumentation

No service map visible
No traces in X-Ray console

This is expected until application code is instrumented and traffic flows.

7. Developer Responsibilities (Reference)

Add AWS X-Ray SDKs (Java, Node.js, Python, etc.)
Enable tracing in application code
Use meaningful service and subsegment names

8. Understanding Service Maps

What Service Maps Show

Connected services
Request flow paths
Error and fault locations
Latency between components

How Maps Are Generated

Applications send trace data
X-Ray builds maps automatically
Maps update dynamically with traffic

9. Viewing Traces and Errors

AWS Console → X-Ray → Traces
AWS Console → X-Ray → Service Map

You can analyze:

4xx / 5xx errors
Slow requests
Latency breakdown per service
Exact failure points

10. Common Issues and Observations

Daemon Running but No Traces

Application not instrumented
No incoming traffic
Wrong AWS region selected

IMDS Errors in Logs

Expected in ECS/Fargate
Safe to ignore
Do not impact trace collection

11. Integration with Amazon CloudWatch

X-Ray integrates natively with CloudWatch
Metrics can be used for alarms
Logs, metrics, and traces can be correlated

CloudWatch = metrics & logs
X-Ray = request-level visibility

12. Best Practices

Always run X-Ray daemon as a sidecar
Use clear and consistent service names
Combine X-Ray with CloudWatch alarms
Enable tracing early in lower environments

13. X-Ray Daemon Container Image

Recommended image:

public.ecr.aws/xray/aws-xray-daemon:latest

Official AWS-maintained image
Actively updated
No Docker Hub rate limits
Optimized for ECS, EKS, and Fargate

14. Conclusion

AWS X-Ray enables deep visibility into distributed systems. By preparing infrastructure in advance, Cloud Engineers allow development teams to activate tracing seamlessly when application changes are introduced.

7. Staging Deployment – ECS Fargate (Isolated Environment)

Overview

This section documents the staging environment deployment of the Helium application using AWS ECS Fargate. The staging setup mirrors production architecture while remaining fully isolated to safely validate changes before promotion.

The staging environment was created with separate clusters, services, task definitions, load balancer, SSL certificate, and Global Accelerator.

Staging Resources Created

ECS & Compute

ECS Cluster: helium-backend-staging-cluster
Backend ECS Service: helium-backend-staging-service
Worker ECS Service: helium-worker-staging-service
Backend Task Definition: helium-backend-staging
Worker Task Definition: helium-worker-staging

Networking & Security

Dedicated Application Load Balancer for staging
Separate Target Groups for backend staging service

Certificates & DNS

Separate ACM certificate created for staging domain
Certificate attached exclusively to the staging ALB

Global Traffic Management

Separate AWS Global Accelerator created for staging
Staging ALB registered as an endpoint in the staging Global Accelerator
Isolation maintained between production and staging traffic paths

Staging Architecture Characteristics

Complete isolation from production resources
Same ECS Fargate launch type
Independent deployment lifecycle
Safe environment for testing infrastructure, scaling, and application changes

Deployment Flow (Staging)

Build and push Docker images (same images reused or staging-tagged as required)
Register staging task definitions (helium-backend-staging, helium-worker-staging)
Deploy backend ECS service in helium-backend-staging-cluster
Deploy worker ECS service with background execution command
Attach staging services to the staging ALB and target groups
Validate health checks and service steady state
Register staging ALB with the staging Global Accelerator
Validate end-to-end traffic flow through Global Accelerator → ALB → ECS services

Purpose of Staging Environment

Validate ECS task definition changes
Test scaling behavior and resource limits
Verify Global Accelerator and ALB routing
Validate certificates and HTTPS termination
Perform safe functional and performance testing before production release

8. CI/CD – ECS Fargate Multi-Region Deployment

This section documents the CI/CD pipeline used to build, push, and deploy Helium backend and worker services across staging and production environments using GitHub Actions and Amazon ECS.

The pipeline supports:

Branch-based deployments
Multi-region Docker image builds
Safe staging deployments
Controlled production rollouts

CI/CD Workflow Definition

name: CICD

# =====================================================
# 🚀 TRIGGERS
# =====================================================
# - beta     → STAGING deploy
# - migrated-demo2 → PRODUCTION deploy
# =====================================================
on:
  push:
    branches:
      - beta
      - migrated-demo2
  workflow_dispatch:

# =====================================================
# 🌍 GLOBAL VARIABLES
# =====================================================
env:
  IMAGE_TAG: v${{ github.run_number }}
  ECR_REPO: helium-backend

# =====================================================
# 🏗️ BUILD & PUSH DOCKER IMAGES
# =====================================================
# - Builds ARM64 image
# - Pushes to ECR in ALL required regions
# - Runs for BOTH staging and production branches
# =====================================================
jobs:
  build:
    name: Build & Push Docker Images
    runs-on: ubuntu-latest

    strategy:
      matrix:
        region:
          - ap-south-1        # Mumbai
          - ap-southeast-1    # Singapore
          - us-east-1         # Virginia

    outputs:
      image_ap: ${{ steps.export.outputs.image_ap }}
      image_sg: ${{ steps.export.outputs.image_sg }}
      image_us: ${{ steps.export.outputs.image_us }}

    steps:
      - name: Checkout Code
        uses: actions/checkout@v4

      - name: Setup Docker Buildx
        uses: docker/setup-buildx-action@v3

      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v3
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ matrix.region }}

      - name: Login to Amazon ECR
        id: login
        uses: aws-actions/amazon-ecr-login@v2

      - name: Build & Push Image
        run: |
          IMAGE_URI=${{ steps.login.outputs.registry }}/${{ env.ECR_REPO }}:${{ env.IMAGE_TAG }}

          docker buildx build \
            --platform linux/arm64 \
            -t $IMAGE_URI \
            --push \
            -f backend/Dockerfile backend

          echo "IMAGE_URI=$IMAGE_URI" >> $GITHUB_ENV

      - name: Export Image Output
        id: export
        run: |
          if [ "${{ matrix.region }}" = "ap-south-1" ]; then
            echo "image_ap=$IMAGE_URI" >> $GITHUB_OUTPUT
          elif [ "${{ matrix.region }}" = "ap-southeast-1" ]; then
            echo "image_sg=$IMAGE_URI" >> $GITHUB_OUTPUT
          else
            echo "image_us=$IMAGE_URI" >> $GITHUB_OUTPUT
          fi

# =====================================================
# 🔵 DEPLOY STAGING (Mumbai ONLY)
# =====================================================
  deploy-staging:
    name: Deploy STAGING (Mumbai)
    runs-on: ubuntu-latest
    needs: build

    if: github.ref_name == 'beta'

    strategy:
      matrix:
        include:
          - task_type: backend
            container: helium-backend-staging
            td: helium-backend-staging
            svc: helium-backend-staging-service
            cluster: helium-mumbai-staging-cluster

          - task_type: worker
            container: helium-worker-staging
            td: helium-worker-staging
            svc: helium-worker-staging-service
            cluster: helium-mumbai-staging-cluster

    steps:
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v3
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ap-south-1

      - name: Download Task Definition
        run: |
          aws ecs describe-task-definition \
            --task-definition ${{ matrix.td }} \
            --query taskDefinition \
          | jq 'del(.taskDefinitionArn,.revision,.status,.requiresAttributes,.compatibilities,.registeredAt,.registeredBy)' > task.json

      - name: Render Task Definition
        uses: aws-actions/amazon-ecs-render-task-definition@v1
        id: render
        with:
          task-definition: task.json
          container-name: ${{ matrix.container }}
          image: ${{ needs.build.outputs.image_ap }}

      - name: Deploy to ECS
        uses: aws-actions/amazon-ecs-deploy-task-definition@v1
        with:
          task-definition: ${{ steps.render.outputs.task-definition }}
          service: ${{ matrix.svc }}
          cluster: ${{ matrix.cluster }}
          wait-for-service-stability: true

# =====================================================
# 🔴 DEPLOY PRODUCTION (MULTI-REGION)
# =====================================================
  deploy-production:
    name: Deploy PRODUCTION
    runs-on: ubuntu-latest
    needs: build

    if: github.ref_name == 'migrated-demo2'

    strategy:
      matrix:
        include:
          - region: ap-south-1
            image: ${{ needs.build.outputs.image_ap }}
            cluster: helium-mumbai-cluster
            container: helium-backend
            td: helium-backend
            svc: helium-backend-service

          - region: ap-south-1
            image: ${{ needs.build.outputs.image_ap }}
            cluster: helium-mumbai-cluster
            container: helium-worker
            td: helium-worker
            svc: helium-worker-service

          - region: ap-southeast-1
            image: ${{ needs.build.outputs.image_sg }}
            cluster: helium-singapore-cluster
            container: helium-backend
            td: helium-backend
            svc: helium-backend-service

          - region: ap-southeast-1
            image: ${{ needs.build.outputs.image_sg }}
            cluster: helium-singapore-cluster
            container: helium-worker
            td: helium-worker
            svc: helium-worker-service

          - region: us-east-1
            image: ${{ needs.build.outputs.image_us }}
            cluster: helium-virginia-cluster
            container: helium-backend
            td: helium-backend
            svc: helium-backend-service

          - region: us-east-1
            image: ${{ needs.build.outputs.image_us }}
            cluster: helium-virginia-cluster
            container: helium-worker
            td: helium-worker
            svc: helium-worker-service

    steps:
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v3
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: ${{ matrix.region }}

      - name: Download Task Definition
        run: |
          aws ecs describe-task-definition \
            --task-definition ${{ matrix.td }} \
            --query taskDefinition \
          | jq 'del(.taskDefinitionArn,.revision,.status,.requiresAttributes,.compatibilities,.registeredAt,.registeredBy)' > task.json

      - name: Render Task Definition
        uses: aws-actions/amazon-ecs-render-task-definition@v1
        id: render
        with:
          task-definition: task.json
          container-name: ${{ matrix.container }}
          image: ${{ matrix.image }}

      - name: Deploy to ECS
        uses: aws-actions/amazon-ecs-deploy-task-definition@v1
        with:
          task-definition: ${{ steps.render.outputs.task-definition }}
          service: ${{ matrix.svc }}
          cluster: ${{ matrix.cluster }}
          wait-for-service-stability: true

9. Staging Deployment & CI/CD Pipeline (Added – Verbatim)

This section is added at the end of the AWS Architecture – Initial Review & High‑level Design (Day 1) documentation, as requested. The content below is incorporated as‑is from the Staging Deployment & CI/CD Pipeline Documentation and aligns with the staging and CI/CD sections already described above. No existing content has been removed or altered.

Scope Covered

Isolated staging environment using ECS Fargate
Separate staging ECS clusters, services, and task definitions
Dedicated staging Application Load Balancer and ACM certificate
Independent AWS Global Accelerator for staging traffic
GitHub Actions based CI/CD pipeline supporting:
- Branch‑based deployments (beta → staging, migrated-demo2 → production)
- Multi‑region Docker image builds
- Safe, controlled rollouts to ECS services

Key Staging Resources

ECS Cluster: helium-backend-staging-cluster
Backend Service: helium-backend-staging-service
Worker Service: helium-worker-staging-service
Task Definitions:
- helium-backend-staging
- helium-worker-staging
Dedicated ALB, target groups, ACM certificate, and Global Accelerator

CI/CD Overview

Docker images are built once per commit (ARM64)
Images are pushed to ECR in Mumbai, Singapore, and Virginia
Staging deploys are automatically triggered from the beta branch
Production deploys are gated and triggered from migrated-demo2
ECS task definitions are rendered dynamically and deployed with service‑stability checks

This completes the Day‑1 architecture documentation by extending it through staging isolation and automated delivery pipelines, ensuring a full lifecycle view from design to deployment.

10. Monitoring & Alerting (AWS Managed Grafana & CloudWatch)

This section documents the monitoring and alerting setup implemented using Amazon CloudWatch and AWS Managed Grafana to ensure visibility into system health and proactive notifications.

Overview

Monitoring was configured for ECS Fargate services (backend and worker) across environments to track CPU and memory utilization, visualize metrics, and trigger alerts based on defined thresholds.

CloudWatch Metrics & Alarms

The following CloudWatch alarms were created for all ECS services:

High CPU Utilization Alarm
- Metric: CPUUtilization
- Threshold: Greater than 70%
- Evaluation: Sustained breach over configured periods
- Action: Trigger alarm notification
High Memory Utilization Alarm
- Metric: MemoryUtilization
- Threshold: Greater than 70%
- Evaluation: Sustained breach over configured periods
- Action: Trigger alarm notification
Normal Utilization (Recovery) Alarm
- Metric: CPU and Memory Utilization
- Threshold: Below 60%
- Purpose: Indicate service has returned to a normal operating range
- Action: Send notification email indicating system stability

Alerting & Notifications

Amazon SNS was used as the notification mechanism for alarms
Email subscriptions were configured to receive:
- Alerts when utilization crosses critical thresholds
- Notifications when services return to normal utilization levels

AWS Managed Grafana Integration

AWS Managed Grafana workspace was created
CloudWatch was added as a data source
Dashboards were configured to visualize:
- ECS service CPU utilization
- ECS service memory utilization
- Service-level performance trends
Grafana provides real-time and historical visibility into service health

Purpose & Benefits

Early detection of performance bottlenecks
Clear visibility into ECS service behavior
Proactive alerting before user impact
Confirmation notifications when systems stabilize

This monitoring setup ensures the platform remains observable, reliable, and operationally ready across staging and production environments.