End-To-End-Documentation
End-To-End Documentation
This document provides a consolidated, end-to-end view of the architecture design, infrastructure experiments, and Kubernetes-based deployment for the Helium application. It is structured as a practical engineering reference and runnable runbook.
1. Architecture Review & Design
1.1 Architecture Understanding
Reviewed the provided AWS architecture diagram in detail
Identified major components, service boundaries, and dependencies
Understood data flow between frontend, backend, worker, and data layers
Analyzed interactions between AWS-managed services and external integrations
1.2 Architecture Diagram Creation
Created an updated AWS architecture diagram
Ensured clear separation of:
Frontend layer
Backend API services
Asynchronous worker services
Data and caching components
Represented VPC boundaries, load balancers, and external service integrations
1.3 AWS Services Studied
Amazon VPC and subnet design
ECS Fargate for container-based workloads
Application Load Balancer (ALB)
NAT Gateway for outbound connectivity
IAM roles and permissions
Amazon SQS for asynchronous processing
1.4 Architectural Enhancements Incorporated
AWS Global Accelerator
Considered for improving global traffic routing and reducing latency
Asynchronous Processing with SQS
SQS positioned between Backend API and Worker services
Designed to decouple synchronous API requests from background processing
1.5 Data Flow Design
Frontend → Backend API
Backend API → SQS
Worker services → Cache / external services
2. ECS, Fargate & Core Infrastructure Setup (Failed During Initial Configuration)
2.1 Fargate & ElastiCache Redis Integration
Attempted integration of backend and worker Fargate containers with ElastiCache Redis
Reviewed VPC and subnet requirements for Redis connectivity
Configured security group rules for Redis access
Updated ECS task IAM roles for backend and worker containers
Resolved S3 permission issues related to
s3:HeadObjectvss3:GetObjectReviewed ECS task role, execution role, and IAM policy behavior
2.2 Debugging & IAM Fixes
Investigated issues related to:
Host port visibility in task definitions
Essential container configuration
Missing S3 permissions
Corrected invalid IAM actions causing
Invalid Action: s3:HeadObjecterrors
2.3 ElastiCache (Redis / Valkey) Setup
Created an ElastiCache Redis (Valkey) cluster inside the VPC
Restricted port 6379 access to the Fargate task security group
2.4 ECS Services & Networking Fixes
Created backend and worker ECS Fargate services
Fixed subnet configuration for frontend, backend, and worker services
Verified ENI attachment and internal VPC routing
Verified ALB → backend service connectivity
Investigated ECS service rollback failures using:
ECS service events
Task definitions
IAM permissions
Verified IAM access for ECS, ECR, CloudWatch Logs, and VPC operations
2.5 Deployment & Resource Configuration
Continued ECS Fargate deployment troubleshooting
Reviewed CPU and memory configuration at task and container levels
Validated ALB target group behavior and frontend–backend traffic flow
3. Kubernetes (kind) Based Deployment & Execution
3.1 Docker Image Build & Push (Docker Hub)
docker login
docker build -t <dockerhub-username>/helium-frontend:latest ./frontend
docker build -t <dockerhub-username>/helium-backend:latest ./backend
docker push <dockerhub-username>/helium-frontend:latest
docker push <dockerhub-username>/helium-backend:latest
3.2 Kubernetes Tooling Installation
#!/bin/bash
set -e # Exit immediately if a command exits with a non-zero status
echo "Updating package index..."
sudo apt-get update
echo "Installing prerequisites..."
sudo apt-get install -y apt-transport-https ca-certificates curl software-properties-common
echo "Adding Docker's official GPG key..."
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg
echo "Setting up Docker stable repository..."
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/docker-archive-keyring.gpg] https://download.docker.com/linux/ubuntu \
$(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
echo "Installing Docker..."
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io
echo "Starting and enabling Docker..."
sudo systemctl start docker
sudo systemctl enable docker
echo "Adding current user to docker group..."
sudo usermod -aG docker $USER
echo "Please log out and log back in so that group changes take effect."
# ------------------------
echo "Installing kind..."
[ $(uname -m) = x86_64 ] && curl -Lo ./kind https://kind.sigs.k8s.io/dl/v0.27.0/kind-linux-amd64
chmod +x ./kind
sudo mv ./kind /usr/local/bin/kind
echo "Installing kubectl..."
VERSION="v1.30.0"
URL="https://dl.k8s.io/release/${VERSION}/bin/linux/amd64/kubectl"
INSTALL_DIR="/usr/local/bin"
curl -LO "$URL"
chmod +x kubectl
sudo mv kubectl $INSTALL_DIR/
kubectl version --client
echo "Cleaning up..."
rm -f kubectl
rm -f kind
echo "kind & kubectl installation complete!"\
chmod +x install.sh
./install.sh
3.3 Kubernetes Cluster Creation
kind create cluster --config kind-cluster.yaml
kubectl get nodes
3.3.1 Namespace & Secrets Setup
kubectl create namespace helium
kubectl apply -n helium -f helium-backend-secret.yaml
kubectl apply -n helium -f helium-worker-secret.yaml
3.3.2 Application Deployment
kubectl apply -n helium -f helium-app.yaml
kubectl get pods -n helium
kubectl get svc -n helium
helium-app.yaml
# (File added exactly as provided by the user)
apiVersion: v1
kind: Namespace
metadata:
name: helium
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: helium-frontend
namespace: helium
spec:
replicas: 2
selector:
matchLabels:
app: helium-frontend
template:
metadata:
labels:
app: helium-frontend
spec:
containers:
- name: frontend
image: <dockerhub-username>/helium-frontend:latest
ports:
- containerPort: 80
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: helium-backend
namespace: helium
spec:
replicas: 2
selector:
matchLabels:
app: helium-backend
template:
metadata:
labels:
app: helium-backend
spec:
containers:
- name: backend
image: <dockerhub-username>/helium-backend:latest
ports:
- containerPort: 8000
envFrom:
- secretRef:
name: helium-backend-secret
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "500m"
memory: "1Gi"
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: helium-worker
namespace: helium
spec:
replicas: 1
selector:
matchLabels:
app: helium-worker
template:
metadata:
labels:
app: helium-worker
spec:
containers:
- name: worker
image: <dockerhub-username>/helium-worker:latest
envFrom:
- secretRef:
name: helium-worker-secret
resources:
requests:
cpu: "250m"
memory: "512Mi"
limits:
cpu: "500m"
memory: "1Gi"
---
apiVersion: v1
kind: Service
metadata:
name: helium-frontend
namespace: helium
spec:
type: NodePort
selector:
app: helium-frontend
ports:
- port: 80
targetPort: 80
nodePort: 30000
---
apiVersion: v1
kind: Service
metadata:
name: helium-backend
namespace: helium
spec:
type: NodePort
selector:
app: helium-backend
ports:
- port: 8000
targetPort: 8000
nodePort: 30001
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: helium-backend-hpa
namespace: helium
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: helium-backend
minReplicas: 2
maxReplicas: 6
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: helium-worker-hpa
namespace: helium
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: helium-worker
minReplicas: 1
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
3.4 Nginx Reverse Proxy Setup & Configuration
sudo apt update
sudo apt install -y nginx
sudo cp helium.conf /etc/nginx/sites-available/helium
sudo ln -s /etc/nginx/sites-available/helium /etc/nginx/sites-enabled/helium
sudo nginx -t
sudo systemctl restart nginx
helium.conf
server {
listen 80;
server_name _;
# FRONTEND
location / {
proxy_pass http://127.0.0.1:30000;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
# BACKEND API
location /api/ {
proxy_pass http://127.0.0.1:30001;
proxy_http_version 1.1;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
}
}
Traffic Flow
Browser → EC2 Public IP (port 80)
/routed to frontend NodePort30000/api/routed to backend NodePort30001
3.5 Horizontal Pod Autoscaling
HPA definitions are included directly in
helium-app.yamlAutoscaling is applied automatically during application deployment
3.6 Metrics Server Installation
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
kubectl get pods -n kube-system | grep metrics-server
3.7 Scaling Validation
watch kubectl get pods -n helium
Traffic was generated against the application UI to validate backend and worker pod autoscaling behavior
3.8 How to Run & Verify (Quick Flow)
# Build & push images
# Install tools
# Create cluster
# Deploy application
# Configure Nginx
4. AWS Infrastructure Foundation – Singapore (Detailed, Step-by-Step)
The following section is included verbatim from the original implementation guide. No steps or technical details have been removed.
Phase 1: Infrastructure Foundation (Singapore)
Overview
This phase sets up the base network infrastructure in Singapore (ap-southeast-1) using the AWS Management Console.
Region:
ap-southeast-1(Singapore)Estimated Time: 45–60 minutes
Prerequisites
AWS Account with administrative access
Logged into AWS Console: https://console.aws.amazon.com
Region switched to Singapore (
ap-southeast-1)
Step 1: Virtual Private Cloud (VPC) Setup
1.1 Create VPC
Navigate to VPC service.
Click Create VPC.
VPC settings:
Resources to create: VPC only
Name tag:
helium-singapore-vpcIPv4 CIDR block:
10.0.0.0/16Tenancy: Default
Click Create VPC.
1.2 Create Subnets
Click Subnets → Create subnet.
Select VPC ID:
helium-singapore-vpc.
Public Subnets
helium-public-subnet-1| AZ:ap-southeast-1a| CIDR:10.0.1.0/24helium-public-subnet-2| AZ:ap-southeast-1b| CIDR:10.0.2.0/24
Private Subnets
helium-private-subnet-1| AZ:ap-southeast-1a| CIDR:10.0.10.0/24helium-private-subnet-2| AZ:ap-southeast-1b| CIDR:10.0.11.0/24
Click Create subnet.
1.3 Internet Gateway (IGW)
Navigate to Internet gateways → Create internet gateway.
Name tag:
helium-igw.Create and attach to VPC
helium-singapore-vpc.
1.4 NAT Gateway
Navigate to NAT gateways → Create NAT gateway.
Name:
helium-nat-gw.Subnet:
helium-public-subnet-1.Connectivity type: Public.
Allocate Elastic IP.
Click Create NAT gateway.
1.5 Route Tables
Public Route Table
Name:
helium-public-rtRoute:
0.0.0.0/0→ Internet GatewayAssociate public subnets
Private Route Table
Name:
helium-private-rtRoute:
0.0.0.0/0→ NAT GatewayAssociate private subnets
Step 2: Security Groups
2.1 Load Balancer Security Group
Name:
helium-alb-sgInbound rules:
HTTP (80) from
0.0.0.0/0HTTPS (443) from
0.0.0.0/0
2.2 ECS Task Security Group
Name:
helium-ecs-sgInbound rules:
TCP 8000 from
helium-alb-sgTCP 3000 from
helium-alb-sg
Step 3: ECR Repositories
Navigate to Elastic Container Registry (ECR).
Create private repositories:
helium-backend
Step 4: SSL Certificate (ACM)
Navigate to Certificate Manager.
Request a public certificate.
Domain:
he2.siteAdditional name:
*.he2.siteValidation method: DNS.
Create DNS records in Route 53.
Step 5: Verification Checklist
VPC
helium-singapore-vpcexistsPublic subnets route to IGW
Private subnets route to NAT Gateway
Security group rules validated
ECR repositories created
ACM certificate issued
Regional Reuse (Virginia & Mumbai)
The exact same steps above are repeated for:
Virginia (
us-east-1)Mumbai (
ap-south-1)
Only the following change:
AWS Region
Availability Zones
Resource names (VPC, subnets, ALB, ECS cluster)
No architectural or procedural changes are required.
Phase 2: Backend Deployment (Singapore) – AWS Console Guide
Overview
This phase deploys the backend services to AWS ECS Fargate in Singapore (ap-southeast-1) using the AWS Management Console.
Region:
ap-southeast-1Estimated Time: 60–90 minutes
Prerequisites
Phase 1 completed
Docker installed locally (required for building images)
AWS CLI installed locally (required for authentication)
Step 1: Push Docker Images (Local Terminal)
Docker images must be built locally. This step cannot be performed from the AWS Console.
Open your local terminal (VS Code / Terminal).
Login to Amazon ECR (Singapore region):
aws ecr get-login-password --region ap-southeast-1 | docker login --username AWS --password-stdin <YOUR_AWS_ACCOUNT_ID>.dkr.ecr.ap-southeast-1.amazonaws.com
Build the backend image (ARM64 architecture):
docker build --platform linux/arm64 -t helium-backend:latest ./backend
Tag the image:
docker tag helium-backend:latest <YOUR_AWS_ACCOUNT_ID>.dkr.ecr.ap-southeast-1.amazonaws.com/helium-backend:latest
Push the image to ECR:
docker push <YOUR_AWS_ACCOUNT_ID>.dkr.ecr.ap-southeast-1.amazonaws.com/helium-backend:latest
Step 2: Secrets Management
Navigate to AWS Secrets Manager (Singapore region).
Click Store a new secret.
Secret type: Other type of secret.
Add the required key/value pairs.
Name the secret:
helium/backend/production.Store the secret and copy the Secret ARN.
Step 3: IAM Roles
3.1 Create ECS Task Execution Role
Trusted entity: Elastic Container Service Task
Attach policies:
AmazonECSTaskExecutionRolePolicyAmazonEC2ContainerRegistryReadOnlyRole name:
helium-ecs-execution-role-sg
3.2 Add Secrets Manager Permission
Attach an inline policy allowing secretsmanager:GetSecretValue for the backend secret ARN.
Step 4: ECS Cluster Creation
Cluster name:
helium-singapore-clusterInfrastructure type: Fargate (serverless)
Step 5: Task Definition
Task definition family:
helium-backend-sgLaunch type: Fargate
OS / Architecture: Linux / ARM64
Task size:
CPU: 0.5 vCPU
Memory: 1 GB
Container configuration:
Name:
backendImage: Backend ECR image
Port: 8000
Secrets injected from AWS Secrets Manager
CloudWatch logging enabled
Step 6: Application Load Balancer (ALB)
Load balancer type: Application Load Balancer
Scheme: Internet-facing
Name:
helium-sg-albListener: HTTPS (443) with ACM certificate
Target group:
Type: IP
Port: 8000
Health check path:
/api/health
Step 7: Deploy ECS Service
Service name:
helium-backend-serviceDesired tasks: 2
VPC:
helium-singapore-vpcSubnets: Private subnets only
Public IP: Disabled
Load balancing configuration:
Load balancer:
helium-sg-albListener: HTTPS (443)
Target group:
helium-backend-tg-sgContainer:
backend:8000
Verification
ECS service reaches Steady state
Target group shows healthy targets
https://<ALB-DNS-NAME>/api/healthreturns HTTP 200Optional Route 53 record:
api-sg.he2.site
Phase 3: Data & Cache Layer (Singapore) – AWS Console Guide
Overview
This phase creates a Redis (ElastiCache) cluster in Singapore for low-latency caching and sets up an S3 bucket for object storage.
Region:
ap-southeast-1Estimated Time: 30–45 minutes
Step 1: Redis (ElastiCache)
Navigate to ElastiCache in the AWS Management Console.
1.1 Create Subnet Group
Click Subnet groups → Create subnet group.
Name:
helium-redis-subnet-group-sg.Description: Singapore Redis Subnets.
VPC:
helium-singapore-vpc.Add subnets: Select both private subnets.
Click Create.
1.2 Create Security Group (EC2)
Navigate to EC2 → Security Groups → Create security group.
Name:
helium-redis-sg.VPC:
helium-singapore-vpc.Inbound rule:
Type: Custom TCP
Port: 6379
Source:
helium-ecs-sg
Click Create.
1.3 Create Redis Cluster
Navigate to ElastiCache → Redis caches → Create Redis cache.
Deployment option: Design your own cache.
Creation method: Cluster.
Name:
helium-redis-sg.Cluster mode: Disabled (enabled optional for scaling).
Node type:
cache.t3.micro.Number of replicas: 1.
Subnet group:
helium-redis-subnet-group-sg.Security group:
helium-redis-sg.Enable encryption at rest and in transit.
Click Create.
1.4 Update Secrets Manager
Wait until Redis status is Available.
Copy the primary endpoint.
Navigate to Secrets Manager →
helium/backend/production.Edit and update
REDIS_HOST.Save changes.
Force new deployment of backend ECS service.
Step 2: S3 Storage (Singapore)
Navigate to Amazon S3.
Click Create bucket.
Bucket name:
helium-files-sg.Region:
ap-southeast-1.Keep Block all public access enabled.
Enable bucket versioning.
Enable server-side encryption (SSE-S3).
Click Create bucket.
2.1 Configure CORS
Open the bucket → Permissions tab.
Edit CORS configuration and paste the JSON policy.
[{
"AllowedHeaders": ["*"],
"AllowedMethods": ["GET", "PUT", "POST", "DELETE", "HEAD"],
"AllowedOrigins": ["https://he2.site", "https://www.he2.site"],
"ExposeHeaders": ["ETag"],
"MaxAgeSeconds": 3000
}
]
Save changes.
Phase 4: Agent Execution (Singapore) – AWS Console Guide
Overview
This phase configures the background worker service in Singapore to process long-running agent and asynchronous tasks using ECS Fargate.
Region:
ap-southeast-1Estimated Time: 20–30 minutes
Prerequisites
Backend ECS service successfully deployed
Redis (ElastiCache) cluster in Available state
Step 1: Create Worker ECS Service
The worker service reuses the same backend Docker image but runs with a different container command.
Navigate to ECS → Clusters → helium-singapore-cluster.
Open the Services tab and click Create.
Compute options: FARGATE.
Task definition family:
helium-backend-sg.Service name:
helium-worker-service.Desired tasks: 2.
Networking configuration:
VPC:
helium-singapore-vpcSubnets: Private subnets
Security group:
helium-ecs-sgPublic IP: Disabled
Container overrides:
Container:
backendCommand:
python,run_worker.py
Click Create.
Step 2: Service Auto Scaling (Optional)
Open
helium-worker-service.Update service and enable auto scaling.
Minimum tasks: 1
Maximum tasks: 10
Scaling policy: Target tracking
Metric: ECSServiceAverageCPUUtilization
Target value: 7
Regional Reuse (Virginia & Mumbai)
The same Agent Execution (Phase 4) steps can be repeated for:
Virginia (
us-east-1)Mumbai (
ap-south-1)
Only the following change:
AWS Region
ECS cluster name
Subnets and security groups (region-specific)
No architectural or procedural changes are required.
5. AWS Global Accelerator & Route 53 Integration
1. Purpose of This Document
This section describes the end-to-end setup of AWS Global Accelerator (GA) integrated with multi-region Application Load Balancers (ALBs) and Amazon Route 53. It covers architecture, configuration steps, routing behavior, traffic flow, and best practices from a Cloud Engineer perspective.
2. What is AWS Global Accelerator?
AWS Global Accelerator is a global networking service that improves application availability and performance by directing user traffic to the nearest healthy AWS Region using Anycast static IPs.
Key Characteristics
Global service (not region-specific)
Provides two static Anycast IP addresses
Routes traffic at AWS Edge locations
Supports multi-region endpoints (ALB, NLB, EC2, Elastic IP)
Offers near-instant regional failover
Important: Global Accelerator is managed from US West (Oregon). This is only the control plane location and does not mean that application traffic flows through Oregon.
3. Architecture Overview
User → Route 53 (DNS) → Global Accelerator (Anycast IP) → Nearest AWS Edge Location → Closest Healthy Region → Regional ALB → ECS Services
4. Why Use a Single Global Accelerator for Multiple Regions
Recommended Design
One Global Accelerator per application
Multiple endpoint groups (one per AWS Region)
Each endpoint group contains the regional ALB
Benefits
Automatic latency-based routing
Fast and seamless regional failover
Simple and clean DNS configuration
Lower operational overhead and cost
What NOT to Do
Do not create multiple Global Accelerators for the same application
Do not use Route 53 latency routing with Global Accelerator
5. Global Accelerator Configuration Steps
Step 1: Create the Global Accelerator
Create a new Global Accelerator
Listener protocol: TCP
Listener ports: 80 / 443 (based on ALB configuration)
Save the GA DNS name and static IP addresses
Step 2: Create Endpoint Groups (One Per Region)
Example regions:
ap-south-1 (Mumbai)
ap-southeast-1 (Singapore)
us-east-1 (Virginia)
For each endpoint group:
Set traffic dial to 100%
Use default health check settings
Step 3: Add ALBs to Endpoint Groups
For each region:
Endpoint type: Application Load Balancer
Select the ALB from the same region
Keep default weight unless traffic tuning is required
6. Security Group Considerations
Global Accelerator itself does not have a security group
Traffic reaches the ALB from AWS Edge locations
ALB Security Group must allow inbound HTTP/HTTPS traffic from either:
0.0.0.0/0(simpler, less restrictive)AWS-managed Global Accelerator prefix list (recommended)
7. Route 53 Configuration (DNS Setup)
Correct Record Setup
Record name:
apiRecord type: A (IPv4)
Alias: Yes
Route traffic to: Alias to Global Accelerator
Routing policy: Simple routing
Evaluate target health: Yes
Health check: Not required
Important Notes
Route 53 always shows US West (Oregon) for Global Accelerator
This is expected behavior
Do not select Latency, Geo, or Weighted routing
8. Why Simple Routing is Mandatory
| Layer | Responsibility |
|---|---|
| Route 53 | DNS resolution only |
| Global Accelerator | Latency routing & failover |
| ALB | Application load balancing |
9. Traffic Flow Explanation
User resolves
api.domain.comRoute 53 returns the Global Accelerator IP
User connects to nearest AWS Edge location
Global Accelerator selects the closest healthy region
Traffic forwards to the regional ALB
ALB routes traffic to ECS service tasks
10. High Availability & Failover
Endpoint health is continuously monitored
If a region becomes unhealthy, traffic is instantly routed to the next healthy region
No DNS TTL or propagation delay
11. AWS Global Accelerator – Traffic Capacity
Global Accelerator
Designed to handle millions of requests per second
Built on the AWS global edge network using Anycast IPs
Scales automatically with traffic spikes
Application Load Balancer (per region – approximate soft limits)
~100,000+ requests per second
~3,000 new connections per second
~100,000 active connections
Actual capacity depends on instance size, target type, and request patterns.
12. Conclusion
This architecture provides:
Global performance optimization
High availability and fast failover
Simple and reliable DNS management
6. AWS X-Ray Integration (ECS – Infrastructure Level)
1. Purpose
This section documents the AWS X-Ray integration for ECS-based microservices from an infrastructure and Cloud Engineer perspective. Application-level instrumentation is intentionally excluded and handled by development teams.
2. What is AWS X-Ray
AWS X-Ray is a distributed tracing service used to analyze, debug, and monitor applications by tracking requests as they traverse AWS services.
Key Capabilities
End-to-end request tracing across services
Visual service maps showing dependencies
Latency and error analysis
Identification of faults, throttling, and failures
Native integration with Amazon CloudWatch
3. When to Use AWS X-Ray
AWS X-Ray is especially useful when:
Applications use a microservices architecture
Requests span multiple AWS services or regions
Latency issues require root-cause analysis
Precise failure points must be identified
4. High-Level Architecture Flow
Client Request → Application Load Balancer (ALB) → ECS Service (Application Container) → X-Ray Daemon (Sidecar Container) → AWS X-Ray Service → Service Maps & Traces
5. Cloud Engineer Responsibilities
Provision and validate X-Ray infrastructure
Add X-Ray daemon as a sidecar container
Configure required IAM permissions
Ensure compatibility with AWS-managed services
Integrate observability with CloudWatch
6. ECS X-Ray Setup (Infrastructure Side)
6.1 Prerequisites
ECS services running behind an Application Load Balancer
Tasks have outbound internet or NAT access
X-Ray supported in the selected AWS region
No account-level enablement is required.
6.2 X-Ray and Application Load Balancer
No manual "Enable X-Ray" option exists on ALB
ALB automatically injects the
X-Amzn-Trace-IdheaderALB segments appear only after backend instrumentation
No ALB configuration changes are required.
6.3 ECS Task Definition – X-Ray Daemon (Sidecar)
For each ECS service (backend, worker):
Add X-Ray daemon as a sidecar container
Container image:
public.ecr.aws/xray/aws-xray-daemon:latest
Expose port
2000/UDPUse the same network mode as the application container
Runs within the same ECS task
Responsibilities of the X-Ray Daemon
Receives trace data from application containers
Forwards trace data to AWS X-Ray
6.4 Environment Variables Configuration
Environment variables must be added to both the application container and the X-Ray daemon container.
Backend
X-Ray daemon container name:
helium-backend-xrayKey:
AWS_XRAY_DAEMON_ADDRESSValue:
helium-backend-xray:2000
Worker
X-Ray daemon container name:
helium-worker-xrayKey:
AWS_XRAY_DAEMON_ADDRESSValue:
helium-worker-xray:2000
Notes
Port
2000/UDPmust be exposedContainer names act as DNS hostnames inside ECS tasks
6.5 IAM Configuration (Task Role)
Attach the managed policy below to the ECS Task Role:
AWSXrayWriteOnlyAccess
This allows:
Sending trace segments and subsegments
Communication with the X-Ray service
The task execution role does not require X-Ray permissions.
6.6 Deploy Updated ECS Services
Register the updated task definition
Deploy changes to backend and worker services
Confirm X-Ray daemon container is in RUNNING state
6.7 Validate X-Ray Daemon Logs
Expected log messages:
Successful initialization
Region detection confirmation
Get instance id metadata failedwarnings (expected in ECS/Fargate)
6.8 Expected State Before Application Instrumentation
No service map visible
No traces in X-Ray console
This is expected until application code is instrumented and traffic flows.
7. Developer Responsibilities (Reference)
Add AWS X-Ray SDKs (Java, Node.js, Python, etc.)
Enable tracing in application code
Use meaningful service and subsegment names
8. Understanding Service Maps
What Service Maps Show
Connected services
Request flow paths
Error and fault locations
Latency between components
How Maps Are Generated
Applications send trace data
X-Ray builds maps automatically
Maps update dynamically with traffic
9. Viewing Traces and Errors
AWS Console → X-Ray → Traces
AWS Console → X-Ray → Service Map
You can analyze:
4xx / 5xx errors
Slow requests
Latency breakdown per service
Exact failure points
10. Common Issues and Observations
Daemon Running but No Traces
Application not instrumented
No incoming traffic
Wrong AWS region selected
IMDS Errors in Logs
Expected in ECS/Fargate
Safe to ignore
Do not impact trace collection
11. Integration with Amazon CloudWatch
X-Ray integrates natively with CloudWatch
Metrics can be used for alarms
Logs, metrics, and traces can be correlated
CloudWatch = metrics & logs
X-Ray = request-level visibility
12. Best Practices
Always run X-Ray daemon as a sidecar
Use clear and consistent service names
Combine X-Ray with CloudWatch alarms
Enable tracing early in lower environments
13. X-Ray Daemon Container Image
Recommended image:
public.ecr.aws/xray/aws-xray-daemon:latest
Official AWS-maintained image
Actively updated
No Docker Hub rate limits
Optimized for ECS, EKS, and Fargate
14. Conclusion
AWS X-Ray enables deep visibility into distributed systems. By preparing infrastructure in advance, Cloud Engineers allow development teams to activate tracing seamlessly when application changes are introduced.
7. Staging Deployment – ECS Fargate (Isolated Environment)
Overview
This section documents the staging environment deployment of the Helium application using AWS ECS Fargate. The staging setup mirrors production architecture while remaining fully isolated to safely validate changes before promotion.
The staging environment was created with separate clusters, services, task definitions, load balancer, SSL certificate, and Global Accelerator.
Staging Resources Created
ECS & Compute
ECS Cluster:
helium-backend-staging-clusterBackend ECS Service:
helium-backend-staging-serviceWorker ECS Service:
helium-worker-staging-serviceBackend Task Definition:
helium-backend-stagingWorker Task Definition:
helium-worker-staging
Networking & Security
Dedicated Application Load Balancer for staging
Separate Target Groups for backend staging service
Certificates & DNS
Separate ACM certificate created for staging domain
Certificate attached exclusively to the staging ALB
Global Traffic Management
Separate AWS Global Accelerator created for staging
Staging ALB registered as an endpoint in the staging Global Accelerator
Isolation maintained between production and staging traffic paths
Staging Architecture Characteristics
Complete isolation from production resources
Same ECS Fargate launch type
Independent deployment lifecycle
Safe environment for testing infrastructure, scaling, and application changes
Deployment Flow (Staging)
Build and push Docker images (same images reused or staging-tagged as required)
Register staging task definitions (
helium-backend-staging,helium-worker-staging)Deploy backend ECS service in
helium-backend-staging-clusterDeploy worker ECS service with background execution command
Attach staging services to the staging ALB and target groups
Validate health checks and service steady state
Register staging ALB with the staging Global Accelerator
Validate end-to-end traffic flow through Global Accelerator → ALB → ECS services
Purpose of Staging Environment
Validate ECS task definition changes
Test scaling behavior and resource limits
Verify Global Accelerator and ALB routing
Validate certificates and HTTPS termination
Perform safe functional and performance testing before production release
8. CI/CD – ECS Fargate Multi-Region Deployment
This section documents the CI/CD pipeline used to build, push, and deploy Helium backend and worker services across staging and production environments using GitHub Actions and Amazon ECS.
The pipeline supports:
Branch-based deployments
Multi-region Docker image builds
Safe staging deployments
Controlled production rollouts
CI/CD Workflow Definition
name: CICD
# =====================================================
# ๐ TRIGGERS
# =====================================================
# - beta → STAGING deploy
# - migrated-demo2 → PRODUCTION deploy
# =====================================================
on:
push:
branches:
- beta
- migrated-demo2
workflow_dispatch:
# =====================================================
# ๐ GLOBAL VARIABLES
# =====================================================
env:
IMAGE_TAG: v${{ github.run_number }}
ECR_REPO: helium-backend
# =====================================================
# ๐️ BUILD & PUSH DOCKER IMAGES
# =====================================================
# - Builds ARM64 image
# - Pushes to ECR in ALL required regions
# - Runs for BOTH staging and production branches
# =====================================================
jobs:
build:
name: Build & Push Docker Images
runs-on: ubuntu-latest
strategy:
matrix:
region:
- ap-south-1 # Mumbai
- ap-southeast-1 # Singapore
- us-east-1 # Virginia
outputs:
image_ap: ${{ steps.export.outputs.image_ap }}
image_sg: ${{ steps.export.outputs.image_sg }}
image_us: ${{ steps.export.outputs.image_us }}
steps:
- name: Checkout Code
uses: actions/checkout@v4
- name: Setup Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v3
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ matrix.region }}
- name: Login to Amazon ECR
id: login
uses: aws-actions/amazon-ecr-login@v2
- name: Build & Push Image
run: |
IMAGE_URI=${{ steps.login.outputs.registry }}/${{ env.ECR_REPO }}:${{ env.IMAGE_TAG }}
docker buildx build \
--platform linux/arm64 \
-t $IMAGE_URI \
--push \
-f backend/Dockerfile backend
echo "IMAGE_URI=$IMAGE_URI" >> $GITHUB_ENV
- name: Export Image Output
id: export
run: |
if [ "${{ matrix.region }}" = "ap-south-1" ]; then
echo "image_ap=$IMAGE_URI" >> $GITHUB_OUTPUT
elif [ "${{ matrix.region }}" = "ap-southeast-1" ]; then
echo "image_sg=$IMAGE_URI" >> $GITHUB_OUTPUT
else
echo "image_us=$IMAGE_URI" >> $GITHUB_OUTPUT
fi
# =====================================================
# ๐ต DEPLOY STAGING (Mumbai ONLY)
# =====================================================
deploy-staging:
name: Deploy STAGING (Mumbai)
runs-on: ubuntu-latest
needs: build
if: github.ref_name == 'beta'
strategy:
matrix:
include:
- task_type: backend
container: helium-backend-staging
td: helium-backend-staging
svc: helium-backend-staging-service
cluster: helium-mumbai-staging-cluster
- task_type: worker
container: helium-worker-staging
td: helium-worker-staging
svc: helium-worker-staging-service
cluster: helium-mumbai-staging-cluster
steps:
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v3
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ap-south-1
- name: Download Task Definition
run: |
aws ecs describe-task-definition \
--task-definition ${{ matrix.td }} \
--query taskDefinition \
| jq 'del(.taskDefinitionArn,.revision,.status,.requiresAttributes,.compatibilities,.registeredAt,.registeredBy)' > task.json
- name: Render Task Definition
uses: aws-actions/amazon-ecs-render-task-definition@v1
id: render
with:
task-definition: task.json
container-name: ${{ matrix.container }}
image: ${{ needs.build.outputs.image_ap }}
- name: Deploy to ECS
uses: aws-actions/amazon-ecs-deploy-task-definition@v1
with:
task-definition: ${{ steps.render.outputs.task-definition }}
service: ${{ matrix.svc }}
cluster: ${{ matrix.cluster }}
wait-for-service-stability: true
# =====================================================
# ๐ด DEPLOY PRODUCTION (MULTI-REGION)
# =====================================================
deploy-production:
name: Deploy PRODUCTION
runs-on: ubuntu-latest
needs: build
if: github.ref_name == 'migrated-demo2'
strategy:
matrix:
include:
- region: ap-south-1
image: ${{ needs.build.outputs.image_ap }}
cluster: helium-mumbai-cluster
container: helium-backend
td: helium-backend
svc: helium-backend-service
- region: ap-south-1
image: ${{ needs.build.outputs.image_ap }}
cluster: helium-mumbai-cluster
container: helium-worker
td: helium-worker
svc: helium-worker-service
- region: ap-southeast-1
image: ${{ needs.build.outputs.image_sg }}
cluster: helium-singapore-cluster
container: helium-backend
td: helium-backend
svc: helium-backend-service
- region: ap-southeast-1
image: ${{ needs.build.outputs.image_sg }}
cluster: helium-singapore-cluster
container: helium-worker
td: helium-worker
svc: helium-worker-service
- region: us-east-1
image: ${{ needs.build.outputs.image_us }}
cluster: helium-virginia-cluster
container: helium-backend
td: helium-backend
svc: helium-backend-service
- region: us-east-1
image: ${{ needs.build.outputs.image_us }}
cluster: helium-virginia-cluster
container: helium-worker
td: helium-worker
svc: helium-worker-service
steps:
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v3
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ matrix.region }}
- name: Download Task Definition
run: |
aws ecs describe-task-definition \
--task-definition ${{ matrix.td }} \
--query taskDefinition \
| jq 'del(.taskDefinitionArn,.revision,.status,.requiresAttributes,.compatibilities,.registeredAt,.registeredBy)' > task.json
- name: Render Task Definition
uses: aws-actions/amazon-ecs-render-task-definition@v1
id: render
with:
task-definition: task.json
container-name: ${{ matrix.container }}
image: ${{ matrix.image }}
- name: Deploy to ECS
uses: aws-actions/amazon-ecs-deploy-task-definition@v1
with:
task-definition: ${{ steps.render.outputs.task-definition }}
service: ${{ matrix.svc }}
cluster: ${{ matrix.cluster }}
wait-for-service-stability: true
9. Staging Deployment & CI/CD Pipeline (Added – Verbatim)
This section is added at the end of the AWS Architecture – Initial Review & High‑level Design (Day 1) documentation, as requested. The content below is incorporated as‑is from the Staging Deployment & CI/CD Pipeline Documentation and aligns with the staging and CI/CD sections already described above. No existing content has been removed or altered.
Scope Covered
Isolated staging environment using ECS Fargate
Separate staging ECS clusters, services, and task definitions
Dedicated staging Application Load Balancer and ACM certificate
Independent AWS Global Accelerator for staging traffic
GitHub Actions based CI/CD pipeline supporting:
Branch‑based deployments (
beta→ staging,migrated-demo2→ production)Multi‑region Docker image builds
Safe, controlled rollouts to ECS services
Key Staging Resources
ECS Cluster:
helium-backend-staging-clusterBackend Service:
helium-backend-staging-serviceWorker Service:
helium-worker-staging-serviceTask Definitions:
helium-backend-staginghelium-worker-staging
Dedicated ALB, target groups, ACM certificate, and Global Accelerator
CI/CD Overview
Docker images are built once per commit (ARM64)
Images are pushed to ECR in Mumbai, Singapore, and Virginia
Staging deploys are automatically triggered from the
betabranchProduction deploys are gated and triggered from
migrated-demo2ECS task definitions are rendered dynamically and deployed with service‑stability checks
This completes the Day‑1 architecture documentation by extending it through staging isolation and automated delivery pipelines, ensuring a full lifecycle view from design to deployment.
10. Monitoring & Alerting (AWS Managed Grafana & CloudWatch)
This section documents the monitoring and alerting setup implemented using Amazon CloudWatch and AWS Managed Grafana to ensure visibility into system health and proactive notifications.
Overview
Monitoring was configured for ECS Fargate services (backend and worker) across environments to track CPU and memory utilization, visualize metrics, and trigger alerts based on defined thresholds.
CloudWatch Metrics & Alarms
The following CloudWatch alarms were created for all ECS services:
High CPU Utilization Alarm
Metric:
CPUUtilizationThreshold: Greater than 70%
Evaluation: Sustained breach over configured periods
Action: Trigger alarm notification
High Memory Utilization Alarm
Metric:
MemoryUtilizationThreshold: Greater than 70%
Evaluation: Sustained breach over configured periods
Action: Trigger alarm notification
Normal Utilization (Recovery) Alarm
Metric: CPU and Memory Utilization
Threshold: Below 60%
Purpose: Indicate service has returned to a normal operating range
Action: Send notification email indicating system stability
Alerting & Notifications
Amazon SNS was used as the notification mechanism for alarms
Email subscriptions were configured to receive:
Alerts when utilization crosses critical thresholds
Notifications when services return to normal utilization levels
AWS Managed Grafana Integration
AWS Managed Grafana workspace was created
CloudWatch was added as a data source
Dashboards were configured to visualize:
ECS service CPU utilization
ECS service memory utilization
Service-level performance trends
Grafana provides real-time and historical visibility into service health
Purpose & Benefits
Early detection of performance bottlenecks
Clear visibility into ECS service behavior
Proactive alerting before user impact
Confirmation notifications when systems stabilize
This monitoring setup ensures the platform remains observable, reliable, and operationally ready across staging and production environments.
Comments
Post a Comment