Cloud Architecture Fundamentals — Cloud Certifications

The restaurant that couldn't handle Saturday night

Maria opens a taco restaurant. It's tiny — one cook, one cash register, ten seats. Monday through Thursday, it's perfect. Then Friday hits. A line wraps around the block. The cook can't keep up. Customers wait 45 minutes, get angry, leave bad reviews, and never come back. Saturday is worse.

Maria has a capacity problem. She has two choices: hire more cooks and expand the kitchen permanently (expensive, and wasteful on slow Tuesdays), or find a way to bring in extra cooks only when the line gets long and send them home when it's quiet.

That second option? That's cloud architecture in a nutshell.

Designing systems in the cloud is exactly like designing a restaurant chain. You need to decide where to put your locations (regions), how to handle rush hour (auto-scaling), who greets customers at the door and seats them at the right table (load balancers), and what happens if the kitchen catches fire (disaster recovery). Every decision you make about your cloud infrastructure maps to a real-world problem that restaurant owners, city planners, and logistics managers have been solving for centuries.

This module gives you the vocabulary and mental models to think like a cloud architect — even if you never write a line of infrastructure code.

The Well-Architected Framework: five pillars

AWS, Azure, and Google Cloud all publish frameworks for building systems that don't fall over. AWS calls theirs the Well-Architected Framework, and it organises everything into five pillars. Think of these as the five non-negotiable qualities of any system worth building:

Pillar	What it means	Restaurant analogy
Operational excellence	Can you run the system smoothly day after day? Monitoring, automation, continuous improvement	The restaurant has checklists, opening/closing procedures, and a manager who reviews what went wrong each night
Security	Is the system protected from threats? Access control, encryption, auditing	Only staff can enter the kitchen. The safe has a combination. Cameras record the register
Reliability	Does the system keep working when things break? Redundancy, failover, recovery	If the main oven breaks, there's a backup. If the power goes out, the generator kicks in
Performance efficiency	Are you using the right resources for the job? Right-sizing, caching, choosing the right tech	You don't hire a head chef to wash dishes, and you don't seat two people at a table for twelve
Cost optimisation	Are you spending wisely? Eliminating waste, using the right pricing model	You don't leave the lights on all night. You buy ingredients in bulk when they're cheaper

Every architectural decision you make should be evaluated against these five pillars. A system that's blazing fast but costs ten times more than it should fails the cost pillar. A system that's cheap but goes down every weekend fails the reliability pillar.

🔑Google and Azure have their own frameworks

Google Cloud calls theirs the "Architecture Framework" and Azure calls theirs the "Azure Well-Architected Framework." The names and categories vary slightly, but the core ideas are nearly identical: build systems that are secure, reliable, performant, cost-effective, and operationally sound. Learn one, and you understand all three.

High availability vs. fault tolerance vs. disaster recovery

These three terms get used interchangeably, but they mean different things. Here's how to keep them straight:

High availability (HA) means the system stays up almost all the time. It's measured in "nines" — 99.9% uptime ("three nines") means about 8.7 hours of downtime per year. 99.99% ("four nines") means about 52 minutes per year. The goal is to minimise downtime, but you accept that brief interruptions might happen.

Restaurant analogy: The restaurant is open 7 days a week, 16 hours a day. Customers can almost always walk in and get a table.

Fault tolerance means the system keeps working even while something is actively broken. No interruption at all — the failure is invisible to the user.

Restaurant analogy: There are two identical kitchens. If one catches fire, the other keeps producing food without a single order being missed. Customers never know anything went wrong.

Disaster recovery (DR) means you have a plan to get the system back up after a major failure. There will be downtime, but you know exactly how to recover and how long it will take.

Restaurant analogy: The restaurant burns down. But you have insurance, a backup location, recipes stored off-site, and a plan to reopen within two weeks.

Concept	Downtime?	Cost	Use when...
High availability	Minimal (seconds to minutes)	Medium	Most production applications
Fault tolerance	Zero	High	Critical systems — banking, healthcare, aviation
Disaster recovery	Hours to days	Lower (planning + backup costs)	Every system needs a DR plan, even if it's simple

There Are No Dumb Questions

"Do I need all three?"

In practice, yes — but at different levels. Every system should have a disaster recovery plan. Most production systems should be highly available. Only mission-critical systems (think stock exchanges, air traffic control) need full fault tolerance, because it's expensive to eliminate every possible point of failure.

"What's the difference between RTO and RPO?"

RTO (Recovery Time Objective) is how fast you need to be back online after a failure. RPO (Recovery Point Objective) is how much data you can afford to lose. If your RPO is 1 hour, you need backups at least every hour. If your RTO is 15 minutes, your recovery process must get you back online within 15 minutes. These two numbers drive every DR decision.

Regions and availability zones: cities and buildings

Cloud providers organise their infrastructure geographically. Understanding this hierarchy is fundamental:

Regions are independent geographic areas — think of them as cities. AWS has regions like us-east-1 (Northern Virginia), eu-west-1 (Ireland), and ap-southeast-1 (Singapore). Each region is completely independent. A disaster in one region doesn't affect another.

Availability Zones (AZs) are isolated data centres within a region — think of them as buildings within a city. Each AZ has its own power, cooling, and networking. If one building floods, the others keep running. A typical region has 3 AZs.

Edge locations are smaller caches spread around the world — think of them as pop-up kiosks in airports and malls. They serve content closer to users for faster delivery (this is what CDNs like CloudFront, Akamai, and Cloudflare use).

Level	Analogy	Purpose	Example
Region	City	Geographic isolation, data residency, latency	`us-east-1` (Virginia)
Availability Zone	Building in a city	Fault isolation within a region	`us-east-1a`, `us-east-1b`
Edge location	Pop-up kiosk	Cache content close to users	400+ locations globally (AWS CloudFront)

When you deploy an application, you choose a region based on three factors: where your users are (lower latency), where your data is legally allowed to live (compliance), and which services are available in that region (not every service is available everywhere).

For high availability, you deploy across multiple AZs within a region. For disaster recovery, you replicate to a different region entirely.

⚠️Data residency matters

If you're building for users in the EU, GDPR may require their data to stay in EU regions. If you're building for the Canadian government, data may need to stay in `ca-central-1`. Choose your region wrong, and you could face legal consequences — not just performance problems.

⚡

Design a Multi-Region Architecture

25 XP

You're building a video streaming platform with users in North America, Europe, and Asia. Your legal team says European user data must stay in the EU. Design your architecture by answering: 1. Which regions would you deploy in? (Name at least 3.) 2. How many availability zones would you use per region and why? 3. Where would you place edge locations and what would they cache? 4. How would you handle the EU data residency requirement? Think about the trade-offs between cost, performance, and compliance.

Auto-scaling: the thermostat

Your home thermostat monitors the temperature. When it gets too cold, the heater turns on. When it's warm enough, the heater turns off. You don't manually flip a switch every time the temperature changes — the system reacts automatically.

Auto-scaling works exactly the same way. You set rules: "If CPU usage exceeds 70% for 5 minutes, add two more servers. If it drops below 30% for 10 minutes, remove one server." The cloud watches your metrics and adjusts capacity automatically.

There are two types:

Vertical scaling (scale up/down) — make a single server bigger or smaller. Give it more CPU, more RAM, more storage. This is like replacing your restaurant's small oven with a bigger one. It has limits — you can only make one machine so large.

Horizontal scaling (scale out/in) — add or remove servers. Instead of one big oven, you install five normal ovens. This is how most cloud systems scale, because there's no upper limit — you can always add more servers.

Type	How it works	Pros	Cons
Vertical	Bigger machine	Simple, no code changes	Has a ceiling, requires downtime to resize
Horizontal	More machines	No ceiling, no downtime	App must be designed for it (stateless)

Auto-scaling is one of the cloud's superpowers. In the on-premises world, you had to buy enough servers for your peak traffic and let them sit idle the rest of the time. With auto-scaling, you pay for peak capacity only when you actually hit peak traffic.

Load balancing: the restaurant host

When you walk into a busy restaurant, the host doesn't send every customer to the same table. They look at which tables are available, which sections are busy, and seat you at the best option. If one section of the restaurant is closed for cleaning, the host routes everyone to the open sections.

A load balancer does exactly this for your servers. It sits in front of your application and distributes incoming requests across multiple servers. If one server goes down, the load balancer stops sending traffic to it. If a new server spins up (thanks to auto-scaling), the load balancer starts including it.

Common load balancing strategies:

Round robin — requests go to each server in turn: Server 1, Server 2, Server 3, Server 1, Server 2... Simple but doesn't account for server health.
Least connections — send the request to whichever server is handling the fewest active connections. Smarter for uneven workloads.
IP hash — the same user always goes to the same server. Useful when you need "sticky sessions."
Weighted — some servers get more traffic than others (e.g., a powerful server gets 60% of traffic, a smaller one gets 40%).

There Are No Dumb Questions

"What happens if the load balancer itself goes down?"

Cloud-managed load balancers (like AWS ALB, Azure Load Balancer, or Google Cloud Load Balancing) are themselves highly available. They run across multiple AZs and are designed to never be a single point of failure. You don't manage the load balancer's infrastructure — the cloud provider does.

"Do I always need a load balancer?"

If you have more than one server, yes. Even if you have just one server, a load balancer adds health checking — it can detect when your server is unhealthy and stop routing traffic to it while you fix the problem.

Microservices vs. monoliths

When you build an application, you have two fundamental architecture choices:

Monolith — everything lives in one big codebase, deployed as one unit. The user interface, business logic, database access, payment processing, email sending — all bundled together.

Restaurant analogy: One kitchen where a single chef handles everything — appetisers, mains, desserts, drinks. If the chef gets sick, the entire restaurant shuts down.

Microservices — the application is split into small, independent services. Each service does one thing, has its own database, and communicates with others via APIs.

Restaurant analogy: Separate stations — one for appetisers, one for grills, one for desserts, one for drinks. Each station has its own chef. If the dessert station breaks down, you can still serve mains.

Factor	Monolith	Microservices
Complexity	Simple to build and deploy initially	Complex — many moving parts
Scaling	Scale the whole thing, even if only one part needs it	Scale individual services independently
Failure	One bug can crash the entire app	Failures are isolated to individual services
Team size	Works well for small teams	Enables large teams to work independently
Best for	Early-stage startups, simple apps	Large-scale systems, big organisations

Most companies start with a monolith and evolve toward microservices as they grow. Don't start with microservices unless you have a large team and a good reason — the operational overhead is significant.

🔑The 'distributed monolith' trap

Some teams try to build microservices but end up with services that are tightly coupled — they can't be deployed independently, they share databases, and changes to one service break others. This gives you the worst of both worlds: the complexity of microservices with none of the benefits. If your services can't be deployed and scaled independently, you have a distributed monolith, not real microservices.

Serverless: no servers to manage (there are still servers)

"Serverless" is one of the most confusingly named concepts in tech. There are definitely servers — you just don't manage, provision, or even think about them. You write a function, upload it, and the cloud runs it whenever it's triggered.

Restaurant analogy: Instead of renting a kitchen and hiring staff, you use a ghost kitchen service. You submit your recipe, and the ghost kitchen makes the dish whenever an order comes in. You pay per dish, not per month. No orders? You pay nothing.

AWS Lambda, Azure Functions, and Google Cloud Functions are the big three serverless platforms. You write a function (a small piece of code), define a trigger (an HTTP request, a file upload, a database change), and the platform handles everything else — scaling, availability, patching, monitoring.

When serverless makes sense:

Event-driven workloads — processing an image after upload, sending an email after a purchase
Unpredictable traffic — a function that gets called 10 times Monday and 10,000 times Friday
Background jobs — data processing, report generation, cleanup tasks

When serverless doesn't make sense:

Long-running processes — most serverless platforms have execution time limits (15 minutes on Lambda)
Consistent high-traffic — if your function runs 24/7 at full capacity, a traditional server may be cheaper
Complex stateful applications — serverless functions are stateless by design

Containers: packing your app in a box

A container packages your application along with everything it needs to run — code, libraries, system tools, settings. It's like packing a lunchbox: everything needed for the meal is inside, and it works no matter what table you eat at.

Before containers, the most dreaded phrase in software was "it works on my machine." An app would run perfectly on a developer's laptop but crash on the production server because of different software versions, missing libraries, or conflicting configurations. Containers solve this by guaranteeing that the app runs the same way everywhere.

Docker is the most popular container technology. You write a Dockerfile that describes what goes in the container, build it into an image, and run that image as a container on any machine with Docker installed.

But running one container is easy. Running hundreds or thousands of containers — starting them, stopping them, restarting crashed ones, distributing them across servers — requires an orchestrator:

Orchestrator	Who runs it	Key feature
Kubernetes (K8s)	Open source (Google origin)	The industry standard. Runs anywhere — AWS, Azure, GCP, or your own servers
Amazon ECS	AWS	Tighter AWS integration, simpler than Kubernetes
Azure Kubernetes Service (AKS)	Azure	Managed Kubernetes on Azure
Google Kubernetes Engine (GKE)	Google Cloud	Managed Kubernetes on GCP (built by the team that created K8s)

Restaurant analogy: Docker is the standardised lunchbox. Kubernetes is the catering company that manages hundreds of lunchboxes — making sure every table gets the right meal, replacing any lunchbox that got dropped, and adding more lunchboxes when a big event is coming.

⚡

Choose the Right Compute Model

25 XP

serverlesscontainerstraditional VMs (virtual machines)

A function that resizes user-uploaded photos, triggered about 200 times per day.

A complex e-commerce platform with 15 interconnected microservices running 24/7.

A legacy banking application from 2008 that can't be modified and requires Windows Server.

A data pipeline that runs once per night, processes 50GB of data, and takes 3 hours.

An API that handles 50 requests per second during business hours and zero at night.

2. A complex e-commerce platform with 15 interconnected microservices running 24/7. →

0/5 answered

Infrastructure as code: blueprints, not handwork

In the early days, setting up a server meant clicking through a web console — manually creating a database here, a network there, a firewall rule somewhere else. This worked until it didn't. When you needed to recreate the same setup in another region, or figure out what changed when something broke, you were hunting through console logs and hoping someone wrote it down.

Infrastructure as Code (IaC) means defining your entire infrastructure in code files — text files that describe every server, database, network, and permission. You check these files into version control (like Git), review them in pull requests, and apply them with a single command.

Restaurant analogy: Instead of telling a contractor "build me a kitchen" and hoping they remember what you said, you give them architectural blueprints. Every detail is documented. If you want to build an identical kitchen in another city, you hand over the same blueprints.

The two dominant IaC tools:

Tool	Created by	Works with	Language
Terraform	HashiCorp	Any cloud provider (AWS, Azure, GCP, Cloudflare, and hundreds more)	HCL (HashiCorp Configuration Language)
CloudFormation	AWS	AWS only	JSON or YAML

Other notable tools include Pulumi (IaC using real programming languages like Python and TypeScript), AWS CDK (write CloudFormation in TypeScript/Python), and Bicep (Azure-specific, simpler than ARM templates).

Why IaC matters:

Reproducibility — spin up identical environments for development, staging, and production
Version control — track every change to your infrastructure the same way you track code changes
Automation — deploy infrastructure changes through CI/CD pipelines, not manual clicks
Documentation — the code is the documentation. You can always see exactly what's deployed
Disaster recovery — if a region goes down, redeploy the entire infrastructure from code in a new region

There Are No Dumb Questions

"Do I need to learn Terraform?"

If you're in a technical role, yes — Terraform is the most widely adopted IaC tool and a near-universal requirement for cloud engineering roles. If you're in a non-technical role, you don't need to write Terraform, but understanding what IaC is and why your engineering team uses it will help you communicate about timelines, infrastructure costs, and change management.

"Can't I just click around in the AWS console?"

You can — and many people do for quick experiments. This is called "ClickOps." But for anything production-grade, ClickOps is risky: there's no audit trail, no easy way to replicate the setup, and one wrong click can take down your system. IaC eliminates these risks.

Putting it all together: the restaurant chain

Let's design a cloud system the way you'd design a restaurant chain.

You want to open a taco chain that serves customers across the US, Europe, and Asia.

Regions — You open locations in three cities: New York, London, and Tokyo. Each location operates independently. If the London location burns down, New York and Tokyo keep serving tacos.
Availability zones — Within each city, you have 3 buildings: a main restaurant, a backup kitchen, and a prep facility. If the main restaurant floods, the backup kitchen takes over.
Auto-scaling — Each location has a core staff of 5, but during lunch rush, you bring in 10 extra workers. After the rush, they go home. You don't pay 15 people to stand around at 3 p.m.
Load balancing — A host at the front door seats customers at the least-busy table. If one section of the restaurant is closed, the host redirects everyone to open sections.
Microservices — The kitchen is divided into stations: grill, prep, desserts, drinks. Each station operates independently. If the dessert station breaks, mains keep flowing.
Containers — Every recipe is standardised in a laminated card. A new cook can pick up any recipe card and produce the exact same dish, every time, at any location.
Infrastructure as code — You have a complete operations manual. Opening a new location means following the playbook, not reinventing everything from scratch.
Serverless — For catering orders (unpredictable, occasional), you use a ghost kitchen. No permanent staff, no fixed costs — you only pay when an order comes in.

This is how real cloud architectures work. Replace "restaurant" with "application," "cook" with "server," and "customers" with "requests," and you have a production-grade distributed system.

⚡

Design Your Own Cloud Architecture

50 XP

Pick a real application you use daily (e.g., Instagram, Uber, Spotify, or your company's product). Sketch out its cloud architecture using the concepts from this module: 1. Which **regions** would you deploy in and why? 2. How would you handle **high availability** within each region? 3. Where would you use **auto-scaling** and what would trigger it? 4. Would you use **microservices** or a **monolith**? What services would you split out? 5. Where would **serverless** make sense for background tasks? 6. How would you manage the infrastructure — **Terraform, CloudFormation**, or something else? Be specific. Don't just say "I'd use auto-scaling" — describe what metric triggers it and what happens when it fires.

Back to Maria's taco restaurant

Maria's capacity problem — a kitchen that handled Tuesday lunch but collapsed on Saturday night — is the same problem every cloud architect solves. Auto-scaling is hiring extra cooks when the line wraps around the block and sending them home when it is quiet. Load balancers are the host seating customers at the right table. Multi-AZ deployment is opening a second kitchen across town so one fire does not shut down the whole operation. Every concept in this module maps to a real-world problem that restaurant owners have been solving for decades — the cloud just makes it possible at internet scale.

Key takeaways

The Well-Architected Framework has five pillars: operational excellence, security, reliability, performance efficiency, and cost optimisation. Every architectural decision should be evaluated against all five.
High availability minimises downtime. Fault tolerance eliminates it. Disaster recovery plans for getting back up after a major failure. Most systems need all three at different levels.
Regions are isolated geographic areas (cities). Availability zones are independent data centres within a region (buildings). Deploy across multiple AZs for high availability, across regions for disaster recovery.
Auto-scaling automatically adjusts capacity based on demand — like a thermostat for your infrastructure. Prefer horizontal scaling (more servers) over vertical scaling (bigger server).
Load balancers distribute traffic across servers, hiding failures and enabling scaling. They're the restaurant host seating customers at the right table.
Microservices split an application into independent services. Start with a monolith unless you have a strong reason not to.
Serverless (Lambda, Azure Functions) lets you run code without managing servers — ideal for event-driven, unpredictable workloads.
Containers (Docker) package apps with their dependencies. Kubernetes orchestrates containers at scale.
Infrastructure as Code (Terraform, CloudFormation) defines infrastructure in version-controlled files — reproducible, auditable, and automatable.

Knowledge Check

1.A company needs its payment processing system to continue working with zero downtime, even if an entire server fails mid-transaction. Which concept best describes this requirement?

2.An e-commerce platform experiences 10x traffic during Black Friday compared to a normal day. Which architectural approach best handles this pattern?

3.A startup wants to process user-uploaded images (resize, compress, generate thumbnails). Uploads are unpredictable — 5 per hour on quiet days, 5,000 per hour after a marketing campaign. Which compute model is the best fit?

4.A team manages their cloud infrastructure by clicking through the AWS web console. A new engineer accidentally deletes a production database. Which practice would have most likely prevented this?

The restaurant that couldn't handle Saturday night

That second option? That's cloud architecture in a nutshell.

This module gives you the vocabulary and mental models to think like a cloud architect — even if you never write a line of infrastructure code.

The Well-Architected Framework: five pillars

Pillar	What it means	Restaurant analogy
Operational excellence	Can you run the system smoothly day after day? Monitoring, automation, continuous improvement	The restaurant has checklists, opening/closing procedures, and a manager who reviews what went wrong each night
Security	Is the system protected from threats? Access control, encryption, auditing	Only staff can enter the kitchen. The safe has a combination. Cameras record the register
Reliability	Does the system keep working when things break? Redundancy, failover, recovery	If the main oven breaks, there's a backup. If the power goes out, the generator kicks in
Performance efficiency	Are you using the right resources for the job? Right-sizing, caching, choosing the right tech	You don't hire a head chef to wash dishes, and you don't seat two people at a table for twelve
Cost optimisation	Are you spending wisely? Eliminating waste, using the right pricing model	You don't leave the lights on all night. You buy ingredients in bulk when they're cheaper

🔑Google and Azure have their own frameworks

High availability vs. fault tolerance vs. disaster recovery

These three terms get used interchangeably, but they mean different things. Here's how to keep them straight:

Restaurant analogy: The restaurant is open 7 days a week, 16 hours a day. Customers can almost always walk in and get a table.

Fault tolerance means the system keeps working even while something is actively broken. No interruption at all — the failure is invisible to the user.

Restaurant analogy: There are two identical kitchens. If one catches fire, the other keeps producing food without a single order being missed. Customers never know anything went wrong.

Disaster recovery (DR) means you have a plan to get the system back up after a major failure. There will be downtime, but you know exactly how to recover and how long it will take.

Restaurant analogy: The restaurant burns down. But you have insurance, a backup location, recipes stored off-site, and a plan to reopen within two weeks.

Concept	Downtime?	Cost	Use when...
High availability	Minimal (seconds to minutes)	Medium	Most production applications
Fault tolerance	Zero	High	Critical systems — banking, healthcare, aviation
Disaster recovery	Hours to days	Lower (planning + backup costs)	Every system needs a DR plan, even if it's simple

There Are No Dumb Questions

"Do I need all three?"

In practice, yes — but at different levels. Every system should have a disaster recovery plan. Most production systems should be highly available. Only mission-critical systems (think stock exchanges, air traffic control) need full fault tolerance, because it's expensive to eliminate every possible point of failure.

"What's the difference between RTO and RPO?"

RTO (Recovery Time Objective) is how fast you need to be back online after a failure. RPO (Recovery Point Objective) is how much data you can afford to lose. If your RPO is 1 hour, you need backups at least every hour. If your RTO is 15 minutes, your recovery process must get you back online within 15 minutes. These two numbers drive every DR decision.

Regions and availability zones: cities and buildings

Cloud providers organise their infrastructure geographically. Understanding this hierarchy is fundamental:

Level	Analogy	Purpose	Example
Region	City	Geographic isolation, data residency, latency	`us-east-1` (Virginia)
Availability Zone	Building in a city	Fault isolation within a region	`us-east-1a`, `us-east-1b`
Edge location	Pop-up kiosk	Cache content close to users	400+ locations globally (AWS CloudFront)

For high availability, you deploy across multiple AZs within a region. For disaster recovery, you replicate to a different region entirely.

⚠️Data residency matters

⚡

Design a Multi-Region Architecture

25 XP

Auto-scaling: the thermostat

There are two types:

Type	How it works	Pros	Cons
Vertical	Bigger machine	Simple, no code changes	Has a ceiling, requires downtime to resize
Horizontal	More machines	No ceiling, no downtime	App must be designed for it (stateless)

Load balancing: the restaurant host

Common load balancing strategies:

Round robin — requests go to each server in turn: Server 1, Server 2, Server 3, Server 1, Server 2... Simple but doesn't account for server health.
Least connections — send the request to whichever server is handling the fewest active connections. Smarter for uneven workloads.
IP hash — the same user always goes to the same server. Useful when you need "sticky sessions."
Weighted — some servers get more traffic than others (e.g., a powerful server gets 60% of traffic, a smaller one gets 40%).

There Are No Dumb Questions

"What happens if the load balancer itself goes down?"

Cloud-managed load balancers (like AWS ALB, Azure Load Balancer, or Google Cloud Load Balancing) are themselves highly available. They run across multiple AZs and are designed to never be a single point of failure. You don't manage the load balancer's infrastructure — the cloud provider does.

"Do I always need a load balancer?"

If you have more than one server, yes. Even if you have just one server, a load balancer adds health checking — it can detect when your server is unhealthy and stop routing traffic to it while you fix the problem.

Microservices vs. monoliths

When you build an application, you have two fundamental architecture choices:

Monolith — everything lives in one big codebase, deployed as one unit. The user interface, business logic, database access, payment processing, email sending — all bundled together.

Restaurant analogy: One kitchen where a single chef handles everything — appetisers, mains, desserts, drinks. If the chef gets sick, the entire restaurant shuts down.

Microservices — the application is split into small, independent services. Each service does one thing, has its own database, and communicates with others via APIs.

Factor	Monolith	Microservices
Complexity	Simple to build and deploy initially	Complex — many moving parts
Scaling	Scale the whole thing, even if only one part needs it	Scale individual services independently
Failure	One bug can crash the entire app	Failures are isolated to individual services
Team size	Works well for small teams	Enables large teams to work independently
Best for	Early-stage startups, simple apps	Large-scale systems, big organisations

🔑The 'distributed monolith' trap

Serverless: no servers to manage (there are still servers)

When serverless makes sense:

Event-driven workloads — processing an image after upload, sending an email after a purchase
Unpredictable traffic — a function that gets called 10 times Monday and 10,000 times Friday
Background jobs — data processing, report generation, cleanup tasks

When serverless doesn't make sense:

Long-running processes — most serverless platforms have execution time limits (15 minutes on Lambda)
Consistent high-traffic — if your function runs 24/7 at full capacity, a traditional server may be cheaper
Complex stateful applications — serverless functions are stateless by design

Containers: packing your app in a box

Orchestrator	Who runs it	Key feature
Kubernetes (K8s)	Open source (Google origin)	The industry standard. Runs anywhere — AWS, Azure, GCP, or your own servers
Amazon ECS	AWS	Tighter AWS integration, simpler than Kubernetes
Azure Kubernetes Service (AKS)	Azure	Managed Kubernetes on Azure
Google Kubernetes Engine (GKE)	Google Cloud	Managed Kubernetes on GCP (built by the team that created K8s)

⚡

Choose the Right Compute Model

25 XP

serverlesscontainerstraditional VMs (virtual machines)

A function that resizes user-uploaded photos, triggered about 200 times per day.

A complex e-commerce platform with 15 interconnected microservices running 24/7.

A legacy banking application from 2008 that can't be modified and requires Windows Server.

A data pipeline that runs once per night, processes 50GB of data, and takes 3 hours.

An API that handles 50 requests per second during business hours and zero at night.

2. A complex e-commerce platform with 15 interconnected microservices running 24/7. →

0/5 answered

Infrastructure as code: blueprints, not handwork

The two dominant IaC tools:

Tool	Created by	Works with	Language
Terraform	HashiCorp	Any cloud provider (AWS, Azure, GCP, Cloudflare, and hundreds more)	HCL (HashiCorp Configuration Language)
CloudFormation	AWS	AWS only	JSON or YAML

Why IaC matters:

Reproducibility — spin up identical environments for development, staging, and production
Version control — track every change to your infrastructure the same way you track code changes
Automation — deploy infrastructure changes through CI/CD pipelines, not manual clicks
Documentation — the code is the documentation. You can always see exactly what's deployed
Disaster recovery — if a region goes down, redeploy the entire infrastructure from code in a new region

There Are No Dumb Questions

"Do I need to learn Terraform?"

If you're in a technical role, yes — Terraform is the most widely adopted IaC tool and a near-universal requirement for cloud engineering roles. If you're in a non-technical role, you don't need to write Terraform, but understanding what IaC is and why your engineering team uses it will help you communicate about timelines, infrastructure costs, and change management.

"Can't I just click around in the AWS console?"

You can — and many people do for quick experiments. This is called "ClickOps." But for anything production-grade, ClickOps is risky: there's no audit trail, no easy way to replicate the setup, and one wrong click can take down your system. IaC eliminates these risks.

Putting it all together: the restaurant chain

Let's design a cloud system the way you'd design a restaurant chain.

You want to open a taco chain that serves customers across the US, Europe, and Asia.

Regions — You open locations in three cities: New York, London, and Tokyo. Each location operates independently. If the London location burns down, New York and Tokyo keep serving tacos.
Availability zones — Within each city, you have 3 buildings: a main restaurant, a backup kitchen, and a prep facility. If the main restaurant floods, the backup kitchen takes over.
Auto-scaling — Each location has a core staff of 5, but during lunch rush, you bring in 10 extra workers. After the rush, they go home. You don't pay 15 people to stand around at 3 p.m.
Load balancing — A host at the front door seats customers at the least-busy table. If one section of the restaurant is closed, the host redirects everyone to open sections.
Microservices — The kitchen is divided into stations: grill, prep, desserts, drinks. Each station operates independently. If the dessert station breaks, mains keep flowing.
Containers — Every recipe is standardised in a laminated card. A new cook can pick up any recipe card and produce the exact same dish, every time, at any location.
Infrastructure as code — You have a complete operations manual. Opening a new location means following the playbook, not reinventing everything from scratch.
Serverless — For catering orders (unpredictable, occasional), you use a ghost kitchen. No permanent staff, no fixed costs — you only pay when an order comes in.

This is how real cloud architectures work. Replace "restaurant" with "application," "cook" with "server," and "customers" with "requests," and you have a production-grade distributed system.

⚡

Design Your Own Cloud Architecture

50 XP

Back to Maria's taco restaurant

Key takeaways

The Well-Architected Framework has five pillars: operational excellence, security, reliability, performance efficiency, and cost optimisation. Every architectural decision should be evaluated against all five.
High availability minimises downtime. Fault tolerance eliminates it. Disaster recovery plans for getting back up after a major failure. Most systems need all three at different levels.
Regions are isolated geographic areas (cities). Availability zones are independent data centres within a region (buildings). Deploy across multiple AZs for high availability, across regions for disaster recovery.
Auto-scaling automatically adjusts capacity based on demand — like a thermostat for your infrastructure. Prefer horizontal scaling (more servers) over vertical scaling (bigger server).
Load balancers distribute traffic across servers, hiding failures and enabling scaling. They're the restaurant host seating customers at the right table.
Microservices split an application into independent services. Start with a monolith unless you have a strong reason not to.
Serverless (Lambda, Azure Functions) lets you run code without managing servers — ideal for event-driven, unpredictable workloads.
Containers (Docker) package apps with their dependencies. Kubernetes orchestrates containers at scale.
Infrastructure as Code (Terraform, CloudFormation) defines infrastructure in version-controlled files — reproducible, auditable, and automatable.

Knowledge Check

1.A company needs its payment processing system to continue working with zero downtime, even if an entire server fails mid-transaction. Which concept best describes this requirement?

2.An e-commerce platform experiences 10x traffic during Black Friday compared to a normal day. Which architectural approach best handles this pattern?

4.A team manages their cloud infrastructure by clicking through the AWS web console. A new engineer accidentally deletes a production database. Which practice would have most likely prevented this?