As an Solutions Architect who has reviewed hundreds of AWS deployments across startups to enterprises, I’ve seen the same architectural mistakes repeated time and again. These aren’t just theoretical concerns, they’re real issues that have led to outages, security breaches, and unnecessarily high cloud bills. Let me walk you through the most common pitfalls I encounter and how to avoid them.
1. The Single Point of Failure: Ignoring Multi-AZ Deployments
One of the most fundamental yet frequently overlooked aspects of AWS architecture is proper multi-availability zone (multi-AZ) deployment. I’ve lost count of how many production systems I’ve seen running entirely in a single AZ.
The problem typically is like this: a startup launches their MVP in eu-west-1, everything works great, and they never review the architecture. Then one day, AWS has a routine maintenance event or an AZ experiences degradation, and their entire application goes dark. What makes this particularly frustrating is how easily avoidable it is.
The solution isn’t just about spreading resources across AZs, it’s about understanding what actually needs to be multi-AZ. Your application servers? Absolutely. Your RDS database? Use Multi-AZ deployments. Your load balancers? They should span at least two AZs. But here’s the catch: you don’t need every component to be multi-AZ. S3 automatically replicates across AZs within a region. Lambda functions run across multiple AZs by default.
The real mistake I see is teams either going all-in on single-AZ deployment to save costs or making everything redundant without understanding the actual failure modes. The key is to map out your critical path and ensure there’s no single AZ failure that could take down your entire service.
2. Security Groups
Security groups are often treated as a necessary work rather than a fundamental security boundary. The most common anti-pattern I encounter is the “wide open” security group: 0.0.0.0/0 on port 22, or worse, all ports open to the internet because “we’ll fix it later.”
Here’s a real example from a recent audit: a company had their production RDS instances accessible from 0.0.0.0/0 on port 3306. When I asked why, the explanation was that developers needed access from home. parent’s house, cafes, etc. The correct solution wasn’t to open the database to the internet, it was to use a bastion host, AWS Systems Manager Session Manager, or a VPN.
The principle I advocate for, is defense in depth with the principle of least privilege. Your security groups should be as restrictive as possible while still allowing legitimate traffic. Web servers should only accept traffic from your load balancer’s security group. Database servers should only accept connections from your application tier. No production resource should allow SSH or RDP from the internet.
A practice that has saved teams countless headaches is documenting why each security group rule exists. When you see a rule allowing port 8080 from a specific CIDR, you should be able to trace it back to a specific requirement. This makes it much easier to audit and clean up rules over time.
3. Misunderstanding VPC Design and Network Topology
VPC design is where I see some of the most creative misconfigurations. The classic mistake is treating AWS like a traditional data center and trying to force familiar network patterns into a cloud environment that works differently.
One common pattern is the “single public subnet” architecture. Everything, web servers, application servers, databases, lives in a single public subnet with public IP addresses. This completely undermines the security benefits of a VPC and exposes your entire infrastructure to the internet.
The correct approach uses a multi-tier subnet strategy. Public subnets should only contain resources that genuinely need to be internet-accessible: load balancers, NAT gateways, and maybe bastion hosts. Your application tier goes in private subnets with internet access through NAT gateways. Your database tier lives in private subnets with no internet access at all.
Another mistake I frequently encounter is poor CIDR planning. Teams will create a VPC with a /24 network, launch a few resources, and then wonder why they can’t add more subnets or connections with their other VPCs due to overlapping ranges. When designing VPCs, think about your growth trajectory. A /16 network gives you 65,536 IP addresses and plenty of room for subnet segmentation. Yes, most of those IPs will go unused, but that’s fine, private IP space is free.
I also see teams struggle with VPC peering and Transit Gateway decisions. VPC peering works great for simple point-to-point connections, but if you’re connecting more than a handful of VPCs, you’ll end up with a mesh topology that’s difficult to manage. Transit Gateway provides a hub-and-spoke model that scales much better, though it comes with additional costs.
4. IAM Policies That Are Too Permissive
IAM is perhaps the most complex aspect of AWS, and the security implications of getting it wrong are severe. The most egregious mistake I see is the use of wildcard permissions and broad policies.
A pattern I encounter regularly: a developer needs to access S3, so someone grants them full S3 access (s3:* on *). This violates the principle of least privilege spectacularly. That developer can now delete any S3 bucket in your account, including your backup buckets and production data stores.
The correct approach is to grant the minimum permissions necessary for a specific task. If someone needs to read from a specific S3 bucket, grant them s3:GetObject on that bucket and its objects. If they need to write, add s3:PutObject. This granularity feels tedious at first, but it dramatically reduces your attack surface.
Another common mistake is creating IAM users instead of using IAM roles. I still see organizations creating IAM users with programmatic access keys for their applications running on EC2. These access keys often end up committed to Git repositories or shared in Slack channels. The solution is to use IAM roles for EC2 instances, which provides temporary credentials that rotate automatically.
Service Control Policies (SCPs) are underutilized as well. If you’re using AWS Organizations, SCPs let you set guardrails that even account administrators can’t override. I recommend using SCPs to prevent things like disabling CloudTrail, modifying GuardDuty settings, or launching instances in unauthorized regions.
5. Cost Optimization
AWS bills can spiral out of control quickly if you’re not paying attention. The mistakes I see here aren’t usually about being unaware of costs, teams see their bills, but rather about not understanding where those costs are coming from or how to optimise them.
The biggest culprit is running everything on-demand. I’ve reviewed accounts spending £50,000 per month on EC2 when they could cut that by 60% with Reserved Instances or Savings Plans for their baseline workload. This usually comes from fear of commitment.
Another common issue is orphaned resources. EBS volumes that are no longer attached to anything. Elastic IPs not associated with running instances (which AWS charges for). Load balancers for services that were decommissioned months ago. Old snapshots that no one remembers creating. I once found an account with over £5,000 per month in costs from NAT Gateway data processing charges because someone was routing all outbound traffic through a single NAT Gateway in one AZ, and most of that traffic was cross-region replication that could have gone through a VPC endpoint.
Data transfer costs often surprise teams. Moving data between AZs costs money. Moving data out to the internet costs significantly more. Using VPC endpoints for S3 and DynamoDB can eliminate data transfer costs for accessing those services. Using CloudFront as a caching layer can dramatically reduce origin request costs.
6. Backup and Disaster Recovery
“We’ll implement backups once we’re in production” is a phrase that haunts me. By the time teams get to production, they’re focused on features and performance, and backups keep getting deprioritized until there’s a disaster.
The most common pattern I see is relying solely on EBS snapshots for backups without testing restoration. Taking snapshots is good, but if you’ve never actually restored from one, you don’t know if your backup strategy works. I recommend regular disaster recovery drills where you restore your entire environment from backups in a separate account or region.
RDS automated backups are enabled by default, which is great, but teams often don’t understand the retention period (default is 7 days) or that automated backups are deleted when you delete the database. For production databases, you should also take manual snapshots before major changes and consider using AWS Backup for centralized backup management across services.
Cross-region backups are often overlooked. If you’re only backing up within the same region, a region-wide outage could leave you without access to your backups when you need them most. S3 Cross-Region Replication and AWS Backup’s cross-region copy features make this straightforward.
7. Monitoring and Observability Gaps
CloudWatch is included with AWS, so most teams have it enabled, but they’re not using it effectively. The default metrics are useful but insufficient for understanding application health and performance.
The biggest mistake is not setting up proper alarms. I see production systems with no alarms at all, or alarms that send emails to a shared inbox that no one monitors. Effective monitoring requires actionable alarms that go to the right people through the right channels. If your database CPU hits 90%, someone should be paged. If your application logs show a spike in 500 errors, your on-call engineer should know immediately.
Custom metrics are underutilized. CloudWatch provides infrastructure metrics, but you should also be sending application-level metrics. How many orders per minute are you processing? What’s your API response time from the application’s perspective? How many items are in your SQS queue? These business and application metrics often provide earlier warning of problems than infrastructure metrics alone.
Log aggregation and analysis is another gap. CloudWatch Logs captures your logs, but without proper log groups, retention policies, and analysis, those logs are just dark data. I recommend using CloudWatch Logs Insights for ad-hoc queries and CloudWatch Log Subscriptions to send logs to a centralized analysis tool if you’re operating at scale.
8. Poor Tagging Strategy (or No Strategy at All)
Tagging seems trivial until you have 500 resources across 10 accounts and can’t figure out which team owns what or how much each project is costing you. The most common mistake is inconsistent or absent tagging.
I’ve seen organizations with tagging policies that look great on paper but aren’t enforced. One team tags everything with Environment, Project, and Owner. Another team uses completely different tags. A third team doesn’t tag anything. When you go to generate a cost report by project, the data is useless.
The solution is to enforce tagging using Service Control Policies and AWS Tag Policies. You can require specific tags on resource creation and even validate tag values against a controlled vocabulary. At minimum, I recommend tagging all resources with: Environment (production, staging, development), Project or Application name, Owner (team or individual), and CostCenter (for chargeback).
Tag-based access control is powerful but underused. You can write IAM policies that only allow users to manage resources with specific tags. This lets you give developers broad permissions within their project boundary while preventing them from affecting other teams’ resources.
9. Database Design
RDS makes it easy to launch a database, which leads to teams making poor architectural decisions without fully understanding the tradeoffs.
The most common mistake is using RDS when DynamoDB would be better (or vice versa). Teams default to RDS, but then they struggle with scaling, connection pooling, and read replica lag. Meanwhile, their access pattern is primarily key-value lookups that would work perfectly in DynamoDB at a fraction of the cost and operational complexity.
On the flip side, I see teams forcing relational data into DynamoDB because they heard NoSQL is more scalable. They end up writing complex application code to maintain consistency across multiple tables and performing inefficient scans because they don’t have proper indexes.
Another issue is not right-sizing database instances. I frequently find production databases running on massively over-provisioned instances because someone launched a db.r5.4xlarge for testing and never revisited it. CloudWatch metrics can show you if your database is actually using all that capacity. Many workloads can run perfectly well on smaller instance types.
Read replica topology is often misunderstood as well. Adding read replicas doesn’t automatically improve performance, your application needs to be modified to route read queries to the replicas. And replica lag can cause consistency issues if not handled properly in your application logic.
10. Not Planning for Failure
The final and perhaps most critical mistake is building architectures that assume everything will work perfectly. AWS services are incredibly reliable, but failures happen. Region-wide outages occur. AZs go down. Individual instances fail.
The teams that handle failures well are those who have explicitly designed for them. They use Auto Scaling Groups so failed instances are automatically replaced. They use multiple AZs so an AZ failure doesn’t take down their service. They use Route 53 health checks and failover routing to redirect traffic away from unhealthy endpoints.
Circuit breakers and retry logic with exponential backoff should be standard in any distributed system. When a downstream dependency fails, your application should fail gracefully rather than cascading that failure to your users. Implementing timeouts, fallbacks, and bulkheads (isolating resources so that failure in one part doesn’t bring down the whole system) are essential patterns.
Moving Forward
These mistakes aren’t signs of incompetence. The key is to recognise these patterns early and address them before they become deeply embedded in your architecture.
My advice for avoiding these pitfalls: invest in proper architecture review early, implement infrastructure as code so your architecture is documented and reproducible, use AWS Well-Architected reviews to identify gaps, and most importantly, build a culture where operational excellence and security are first-class concerns, not afterthoughts.
The cloud gives you incredible power and flexibility, but with that comes responsibility. By learning from these common mistakes, you can build AWS architectures that are secure, reliable, cost-effective, and ready to scale with your business.