In today’s cloud-first world, many organisations find themselves wrestling with a common challenge: monitoring fragmentation. If you’re migrating to AWS from on-premises infrastructure, you’ve likely accumulated a collection of monitoring tools, Grafana here, Zabbix there, maybe some Prometheus, Scrutinizer, and a dash of CloudWatch. Each tool serves a purpose, but together they create operational chaos.
This article walks through a real-world architecture for consolidating multiple monitoring tools into a unified, AWS-native observability platform. Whether you’re monitoring EKS clusters, Active Directory, firewalls, or a hybrid infrastructure, this guide will help you build a single panel of glass for your entire estate.
The Problem: Death by a Thousand Dashboards
Let’s paint a familiar picture:
- 3 AM: Your phone buzzes. Production is down.
- 3:02 AM: You check CloudWatch. Nothing obvious.
- 3:05 AM: Switch to Grafana. Some weird metrics.
- 3:10 AM: Fire up Zabbix. Server CPU is spiking.
- 3:15 AM: But why? Check logs in… wait, where are those logs again?
- 3:25 AM: Finally correlate the issue across four different systems.
- MTTR: 45 minutes (30 of which were spent context-switching between tools)
Sound familiar? You’re not alone.
The Core Requirements
When consolidating monitoring infrastructure, we need to solve for:
- Unified Visibility: One place to see everything
- Proactive Detection: Catch issues before users do
- Fast Root Cause Analysis: Correlate events across layers
- Compliance Ready: Query data for audits without panic
- Operational Efficiency: Stop paying for five tools when one will do
The Solution: AWS-Native Observability Stack
After extensive research and real-world implementation, here’s the architecture that actually works:
┌─────────────────────────────────────────────────────────┐
│ Visualization Layer │
│ CloudWatch Dashboards | Managed Grafana | QuickSight │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ Analytics & Investigation Layer │
│ CloudWatch Insights | Athena | OpenSearch Service │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ Centralized Data Lake (Optional) │
│ AWS Security Lake (OCSF) │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ Monitoring & Security Services │
│ CloudWatch | Security Hub | GuardDuty | Config │
└─────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────┐
│ Your Infrastructure │
│ EKS | EC2 | Lambda | RDS | On-Prem (Logs Only) │
└─────────────────────────────────────────────────────────┘
The Core AWS Services
Let’s break down each component:
1. Amazon CloudWatch: Your Foundation
CloudWatch is unavoidable when working with AWS. Instead of fighting it, embrace it as your foundation.
What You Get:
- Metrics: CPU, memory, disk, network, custom application metrics
- Logs: Centralized log aggregation with retention policies
- Alarms: Threshold-based and anomaly detection alerting
- Dashboards: Pre-built and custom operational views
- Insights: SQL-like queries for log analysis
Real-World Setup:
{
"agent": {
"metrics_collection_interval": 60
},
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/application/*.log",
"log_group_name": "/aws/application/myapp",
"log_stream_name": "{instance_id}",
"retention_in_days": 30
}
]
}
}
},
"metrics": {
"namespace": "CustomApp/Metrics",
"metrics_collected": {
"cpu": {
"measurement": [
{"name": "cpu_usage_idle", "unit": "Percent"}
]
},
"mem": {
"measurement": [
{"name": "mem_used_percent", "unit": "Percent"}
]
}
}
}
}
2. Container Insights for EKS
If you’re running Kubernetes on AWS, Container Insights is a game-changer.
Deployment:
# Enable control plane logging
aws eks update-cluster-config \
--name my-cluster \
--logging '{"clusterLogging":[{"types":["api","audit","authenticator"],"enabled":true}]}'
# Deploy FluentBit DaemonSet
kubectl apply -f https://raw.githubusercontent.com/aws-samples/amazon-cloudwatch-container-insights/latest/k8s-deployment-manifest-templates/deployment-mode/daemonset/container-insights-monitoring/fluent-bit/fluent-bit.yaml
What You See:
- Cluster-level metrics (CPU, memory, network)
- Namespace and pod-level breakdowns
- Node performance and capacity
- Application logs automatically collected from stdout/stderr
This replaces your standalone Prometheus + Grafana setup for most use cases.
3. AWS Security Hub: Your Security Command Center
Think of Security Hub as your security findings aggregator. It’s like having a security operations assistant that never sleeps.
What It Aggregates:
- GuardDuty: AI-powered threat detection
- AWS Config: Configuration compliance
- IAM Access Analyzer: Permission issues
- Macie: Sensitive data discovery
- Inspector: Vulnerability scanning
Compliance Made Easy:
# Enable Security Hub with CIS AWS Foundations Benchmark
aws securityhub enable-security-hub \
--enable-default-standards
# Get compliance summary
aws securityhub get-findings \
--filters '{"ComplianceStatus": [{"Value": "FAILED", "Comparison": "EQUALS"}]}'
4. Amazon OpenSearch: Your SIEM Replacement
Replacing Microsoft Sentinel? OpenSearch Service is your answer.
Why OpenSearch Over Sentinel?
Use OpenSearch’s anomaly detection feature. It’s surprisingly good at catching unusual patterns you’d miss manually.
5. AWS Security Lake: The Long-Term Play
Here’s where things get interesting. Security Lake is AWS’s answer to the question: “Where do I store petabytes of security data without going bankrupt?”
The OCSF Advantage
Security Lake automatically normalizes logs to the Open Cybersecurity Schema Framework (OCSF). This means:
- Standardized queries across all log sources
- Multi-cloud ready (Azure, GCP logs can be normalized too)
- Future-proof (vendor-agnostic format)
When to Use Security Lake:
✅ YES if you need:
- 1 year log retention
- Compliance with strict audit requirements
- Multi-cloud strategy
- Cost-effective long-term storage (S3 is cheap!)
❌ NO if you need:
- Real-time alerting (use CloudWatch + OpenSearch instead)
- Simple single-account setup
- Quick implementation (<4 weeks)
Use Security Lake for retention, OpenSearch for hot analytics (last 30 days).
6. The On-Premises Challenge
Let’s address the elephant in the room: on-premises monitoring in a cloud-native world.
What’s Realistic:
✅ You CAN:
- Forward logs via CloudWatch Agent
- Send syslogs via Kinesis Firehose
- Store and search on-prem logs in AWS
- Create basic alerts on log patterns
❌ You CANNOT (easily):
- Get real-time metrics dashboards
- Automated remediation for on-prem resources
- Full observability parity with AWS resources
The Pragmatic Approach:
# On-premises server → CloudWatch Logs
# Install agent
wget https://s3.amazonaws.com/amazoncloudwatch-agent/linux/amd64/latest/amazon-cloudwatch-agent.deb
sudo dpkg -i amazon-cloudwatch-agent.deb
# Configure to send logs only (no metrics)
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a fetch-config \
-m onPremise \
-s \
-c file:/opt/aws/amazon-cloudwatch-agent/etc/config.json
For true on-premises monitoring, you might need to keep Zabbix or Prometheus for a while.
Architecture Decision: Two Approaches
Approach 1: With Security Lake (Compliance-First)
Best For: Healthcare, finance, government, or anyone with >1 year log retention requirements
AWS Services → Security Lake (S3/OCSF) → Athena (SQL queries)
↓
OpenSearch (Last 30 days hot analytics)
↓
CloudWatch Dashboards + Managed Grafana
Pros:
- Cost-effective long-term retention
- OCSF standardization
- Multi-cloud ready
- Compliance-friendly
Cons:
- More complex setup
- Longer implementation (16-20 weeks)
- Requires OCSF knowledge
Approach 2: Direct CloudWatch/OpenSearch (Speed-First)
Best For: Startups, lower compliance reqs, quick wins
AWS Services → CloudWatch Logs → OpenSearch (direct)
↓
CloudWatch Dashboards + Managed Grafana
↓
S3 (archived logs via export)
Pros:
- Faster implementation
- Simpler architecture
- Real-time everything
- Lower learning curve
Cons:
- Higher CloudWatch Logs costs at scale
- No OCSF normalization
- OpenSearch storage costs
Real-World Implementation: Step-by-Step
Let’s build this thing. Here’s the actual deployment sequence:
Week 1-2: Foundation
# 1. Enable AWS Organizations (if not already)
aws organizations create-organization
# 2. Enable CloudTrail (all regions, all accounts)
aws cloudtrail create-trail \
--name organization-trail \
--s3-bucket-name my-cloudtrail-bucket \
--is-organization-trail \
--is-multi-region-trail
# 3. Enable GuardDuty
aws guardduty create-detector --enable
# 4. Enable Security Hub
aws securityhub enable-security-hub
# 5. Enable AWS Config
aws configservice put-configuration-recorder \
--configuration-recorder name=default,roleARN=arn:aws:iam::ACCOUNT:role/aws-service-role/config.amazonaws.com/AWSServiceRoleForConfig
aws configservice start-configuration-recorder \
--configuration-recorder-name default
Week 3-4: EKS Monitoring
# fluent-bit-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: amazon-cloudwatch
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Log_Level info
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser docker
Tag kube.*
[FILTER]
Name kubernetes
Match kube.*
Kube_URL https://kubernetes.default.svc:443
Merge_Log On
[OUTPUT]
Name cloudwatch_logs
Match kube.*
region us-east-1
log_group_name /aws/eks/my-cluster
log_stream_prefix app-
auto_create_group true
kubectl apply -f fluent-bit-config.yaml
Week 5-6: OpenSearch SIEM
# Create OpenSearch domain
aws opensearch create-domain \
--domain-name security-analytics \
--engine-version "OpenSearch_2.11" \
--cluster-config InstanceType=r6g.large.search,InstanceCount=3 \
--ebs-options EBSEnabled=true,VolumeType=gp3,VolumeSize=100 \
--encryption-at-rest-options Enabled=true \
--node-to-node-encryption-options Enabled=true \
--advanced-security-options Enabled=true,InternalUserDatabaseEnabled=false
Week 7-8: Dashboards and Alerts
// cloudwatch-dashboard.json
{
"widgets": [
{
"type": "metric",
"properties": {
"metrics": [
["AWS/EC2", "CPUUtilization", {"stat": "Average"}]
],
"period": 300,
"stat": "Average",
"region": "us-east-1",
"title": "EC2 CPU Overview"
}
},
{
"type": "log",
"properties": {
"query": "SOURCE '/aws/eks/my-cluster' | fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 20",
"region": "us-east-1",
"title": "Recent Errors"
}
}
]
}
aws cloudwatch put-dashboard \
--dashboard-name "Production-Overview" \
--dashboard-body file://cloudwatch-dashboard.json
Automated Incident Response
Here’s where it gets interesting. Let’s automate security responses:
# lambda/security_response.py
import boto3
ec2 = boto3.client('ec2')
sns = boto3.client('sns')
def lambda_handler(event, context):
"""
Responds to GuardDuty findings automatically
"""
finding = event['detail']
finding_type = finding['type']
# SSH Brute Force detected
if 'SSHBruteForce' in finding_type:
instance_id = finding['resource']['instanceDetails']['instanceId']
# Quarantine instance
ec2.modify_instance_attribute(
InstanceId=instance_id,
Groups=['sg-quarantine'] # Pre-created quarantine security group
)
# Notify team
sns.publish(
TopicArn='arn:aws:sns:us-east-1:ACCOUNT:security-alerts',
Subject=f'CRITICAL: Instance {instance_id} Quarantined',
Message=f'Detected SSH brute force attack. Instance automatically isolated.\n\nFinding: {finding}'
)
return {'status': 'quarantined', 'instance': instance_id}
EventBridge Rule:
{
"source": ["aws.guardduty"],
"detail-type": ["GuardDuty Finding"],
"detail": {
"severity": [7, 8, 8.9],
"type": ["UnauthorizedAccess:EC2/SSHBruteForce"]
}
}
Result: Threat detected → Instance isolated → Team notified. All in <30 seconds.
Cost Optimization Tips
Let’s talk money. Here’s how to keep costs reasonable:
1. CloudWatch Logs: The Got you by Surprise Cost
# Set appropriate retention periods
import boto3
logs = boto3.client('logs')
# Development logs: 7 days
logs.put_retention_policy(
logGroupName='/aws/lambda/dev-functions',
retentionInDays=7
)
# Production logs: 30 days
logs.put_retention_policy(
logGroupName='/aws/lambda/prod-functions',
retentionInDays=30
)
# Compliance logs: Export to S3, then delete
logs.put_retention_policy(
logGroupName='/aws/cloudtrail',
retentionInDays=90
)
2. Use Log Sampling
Not every log line needs immediate indexing:
# Sample 10% of high-volume logs
import random
def lambda_handler(event, context):
if random.random() < 0.1: # 10% sampling
# Send to OpenSearch
pass
# Always send to S3 (cheap storage)
# Send everything
3. OpenSearch Reserved Instances
# Save 30-40% with 1-year reserved capacity
aws opensearch purchase-reserved-instance-offering \
--reserved-instance-offering-id offering-id \
--instance-count 3
4. S3 Intelligent-Tiering
# Automatic cost optimization for Security Lake
aws s3api put-bucket-intelligent-tiering-configuration \
--bucket security-lake-bucket \
--id intelligent-tiering \
--intelligent-tiering-configuration '{
"Id": "intelligent-tiering",
"Status": "Enabled",
"Tierings": [
{"Days": 90, "AccessTier": "ARCHIVE_ACCESS"},
{"Days": 180, "AccessTier": "DEEP_ARCHIVE_ACCESS"}
]
}'
Migration Strategy: The Practical Path
Don’t try to do everything at once. Here’s the battle-tested sequence:
Phase 1: AWS Resources
- Start with EKS (highest ROI)
- Add EC2 instances
- Enable RDS Enhanced Monitoring
- Configure Lambda logging
Win: 60% of your monitoring consolidated
Phase 2: Security
- Enable Security Hub
- Deploy GuardDuty
- Set up OpenSearch SIEM
- Migrate from Sentinel
Win: Security team has single console
Phase 3: Dashboards
- Build CloudWatch operational dashboards
- Deploy Managed Grafana
- Recreate critical legacy dashboards
- Train operations team
Win: Ops team stops using old tools
Phase 4: On-Premises
- Deploy CloudWatch Agent to servers
- Configure syslog forwarding
- Archive on-prem logs in S3
Phase 5: Decommission
- Parallel run validation (2 weeks)
- Export historical data
- Turn off Zabbix, Prometheus
- Reclaim licenses and infrastructure
Common Issues (And How to Avoid Them)
Issue #1: CloudWatch Logs Cost Explosion
Problem: Someone enables debug logging in production
Solution:
# Implement log sampling and filtering at source
import logging
import watchtower
# Only send WARNING and above to CloudWatch
handler = watchtower.CloudWatchLogHandler(log_group='/aws/app')
handler.setLevel(logging.WARNING)
logger = logging.getLogger(__name__)
logger.addHandler(handler)
Issue #2: Alert Fatigue
Problem: 500 alerts per day, all marked “critical”
Solution:
# Implement alert prioritization
def calculate_severity(metric_value, threshold):
if metric_value > threshold * 1.5:
return 'CRITICAL' # Page on-call
elif metric_value > threshold * 1.2:
return 'WARNING' # Slack notification
else:
return 'INFO' # Log only
Issue #3: The “We’ll Monitor Everything” Trap
Problem: Monitoring 10,000 metrics per instance
Solution: Start with the Golden Signals:
- Latency: How long requests take
- Traffic: Request volume
- Errors: Failure rate
- Saturation: Resource utilization
# Focused metric collection
CRITICAL_METRICS = [
'CPUUtilization',
'MemoryUtilization',
'NetworkIn',
'NetworkOut',
'DiskReadOps',
'DiskWriteOps'
]
Issue #4: Forgetting About Cardinality
Problem: OpenSearch cluster dies from high-cardinality fields
Solution:
# Don't index user IDs, session IDs, or timestamps as keywords!
PUT /logs/_mapping
{
"properties": {
"user_id": {
"type": "text", # Don't use "keyword" for high-cardinality
"index": false # Don't index if you won't search it
},
"timestamp": {
"type": "date"
}
}
}
Success Metrics: Measuring Your Win
Old Way: "Let me check 5 systems..."
Time to answer: 15-30 minutes
New Way: "Here's the CloudWatch dashboard..."
Time to answer: 30 seconds
Troubleshooting Guide
Issue: CloudWatch Agent Not Sending Logs
# Check agent status
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl \
-a query -m ec2 -c default
# Check agent logs
sudo tail -f /opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log
# Common fix: IAM permissions
# Ensure instance role has CloudWatchAgentServerPolicy
Issue: OpenSearch “Cluster Red” Status
# Check cluster health
curl -XGET 'https://your-domain.region.es.amazonaws.com/_cluster/health?pretty'
# Common causes:
# 1. Unassigned shards (need more nodes)
# 2. Disk space >85% used (scale storage)
# 3. JVM pressure (scale instance type)
# Quick fix: Delete old indices
curl -XDELETE 'https://your-domain.region.es.amazonaws.com/old-index-*'
Issue: High CloudWatch Costs
# Find expensive log groups
aws logs describe-log-groups \
--query 'logGroups[*].[logGroupName,storedBytes]' \
--output table | sort -k2 -rn
# Check for debug logs in production
aws logs filter-log-events \
--log-group-name /aws/lambda/my-function \
--filter-pattern "DEBUG" \
--limit 10
Best Practices Checklist
✅ Before You Start:
- [ ] Document current log volumes (GB/day)
- [ ] List all alert rules from legacy systems
- [ ] Identify compliance retention requirements
- [ ] Get buy-in from security and ops teams
- [ ] Set realistic budget expectations
✅ During Implementation:
- [ ] Start with non-production environment
- [ ] Run legacy and new systems in parallel (2+ weeks)
- [ ] Train ops team before cutover
- [ ] Have rollback plan ready
- [ ] Document everything (future you will thank you)
✅ After Go-Live:
- [ ] Monitor CloudWatch costs daily (first month)
- [ ] Review alert effectiveness weekly
- [ ] Gather user feedback from ops team
- [ ] Optimize based on actual usage patterns
- [ ] Schedule quarterly reviews
The Bottom Line
Consolidating from multiple monitoring tools to a unified AWS native stack isn’t just about reducing complexity, it’s about operational excellence:
- Faster incident response: 15 minutes instead of 45
- Better security posture: Automated threat response
- Compliance confidence: Query any log in seconds
- Cost savings: £5-10k+/year in eliminated tools
- Happier ops team: One system to master, not five
Getting Started
If you’re ready to begin:
- Week 1: Audit current tools and costs
- Week 2: Estimate AWS costs with AWS Pricing Calculator
- Week 3: POC with non-prod EKS cluster
- Week 4: Build business case
- Week 5+: Execute phased migration
Resources
AWS Documentation
Tools
Community
Conclusion
Building a unified AWS monitoring solution is a journey, not a destination. Start small, prove value quickly, and iterate based on real-world usage.
The goal isn’t monitoring perfection, it’s operational sanity. When your phone rings at 3 AM, you want answers in minutes, not a hunt across five different tools.
Tags: #aws #monitoring #observability #cloudwatch #devops #sre #kubernetes #eks #security #siem