AWS DevOps Interview Questions 2026: The Complete Guide
AWS interviews in 2026 have evolved. Companies aren't just testing tool knowledge โ they're testing how you think under production pressure. Here are the questions that matter most, with the frameworks that score 8+/10 consistently.
Section 1: Core AWS Services
1. What is the difference between a Security Group and a NACL?
Why they ask this: This is a filtering question โ they want to see if you understand stateful vs stateless, and whether you know when to use each.
Ideal answer framework:
Security Groups are stateful โ if you allow inbound traffic on port 80, the return traffic is automatically allowed. They operate at the instance level.
NACLs (Network Access Control Lists) are stateless โ you must explicitly allow both inbound AND outbound traffic. They operate at the subnet level and process rules in number order (lowest first).
When to use each:
- Security Groups: instance-level control, application-layer rules
- NACLs: subnet-level control, blocking specific IPs, compliance requirements
Follow-up they often ask: "If you wanted to block a specific IP from reaching your entire VPC, which would you use?" โ NACL, because it operates at subnet level before traffic reaches instances.
2. Explain VPC peering vs Transit Gateway
Why they ask this: Architects need to understand network topology at scale. This tests whether you've worked on multi-account or multi-region setups.
Ideal answer:
VPC Peering is a direct 1:1 connection between two VPCs. It's non-transitive โ if VPC A peers with VPC B, and VPC B peers with VPC C, VPC A cannot talk to VPC C through B. Works across accounts and regions.
Transit Gateway is a hub-and-spoke model. All VPCs connect to the Transit Gateway, which routes traffic between them. Supports thousands of VPCs, on-premises connections, and is transitive by design.
Decision framework:
- 2-5 VPCs with simple connectivity โ VPC Peering (cheaper, simpler)
- 5+ VPCs, multi-account, on-premises hybrid โ Transit Gateway
3. Your EC2 instance is showing high CPU but your application load is normal. What do you investigate?
Why they ask this: This is a debugging scenario. They want to see your systematic troubleshooting approach.
Step-by-step investigation:
1. Check the processes: SSH in, run top or htop. Identify which process is consuming CPU.
2. Check for noisy neighbors: On a shared instance, other tenants can cause steal time. Check %st in top.
3. Check CloudWatch metrics: CPU credit balance (for T-series), network I/O, disk I/O โ sometimes CPU spikes are I/O wait masquerading.
4. Check for crypto mining: An unusual process with a generic name consuming high CPU is a red flag.
5. Check auto scaling triggers: Was a scaling event supposed to happen but didn't?
Common gotcha: T3/T2 instances have CPU credits. When credits run out, they throttle. Check CPUCreditBalance metric.
4. What is S3 Transfer Acceleration and when would you use it?
Ideal answer: Transfer Acceleration uses CloudFront's edge network to speed up uploads to S3. Instead of uploading directly to your S3 bucket (which may be in us-east-1), you upload to the nearest CloudFront edge, which then transfers via AWS's optimized backbone network.
Use when: You have users globally uploading large files (videos, backups) to a centralized S3 bucket. Typical improvement: 50-500% for large files over long distances.
Don't use when: Users are in the same region as the S3 bucket โ the overhead of routing through edge negates any benefit.
Section 2: CI/CD & Automation
5. Walk me through your CI/CD pipeline for a containerized application
Why they ask this: Every senior DevOps role requires pipeline design knowledge. They want to see end-to-end thinking.
Strong answer structure:
Developer pushes code โ
GitHub Actions/Jenkins triggers โ
1. Unit tests + linting
2. Docker image build
3. Image scan (Trivy/Snyk)
4. Push to ECR
5. Update Helm chart / Kubernetes manifest
6. Deploy to staging (ArgoCD/Flux)
7. Integration tests
8. Manual approval gate
9. Deploy to production
10. Smoke tests
11. Rollback trigger if health checks failWhat makes this answer strong: mentioning image scanning, approval gates, and automatic rollback. Most candidates skip these.
6. How do you handle secrets in a CI/CD pipeline?
Why they ask this: Security is non-negotiable. Wrong answers here are disqualifying.
What NOT to say: Hardcode in environment variables, store in code, put in .env files committed to git.
Correct approaches (mention at least two):
- AWS Secrets Manager with IAM role-based access โ application retrieves secrets at runtime
- Parameter Store (SSM) for non-sensitive config, SecureString for secrets
- HashiCorp Vault for multi-cloud or on-premises
- Kubernetes Secrets (with encryption at rest) + external-secrets-operator to sync from AWS
Rotation: Mention that Secrets Manager supports automatic rotation with Lambda functions.
Section 3: Kubernetes on AWS (EKS)
7. What is the difference between a Deployment and a StatefulSet?
Ideal answer:
Deployment manages stateless pods. Pods are interchangeable โ they can be killed and replaced in any order. Each pod gets a random name suffix (e.g., nginx-7d8f9c-xkz2p).
StatefulSet manages stateful pods. Each pod has a stable, predictable identity (mysql-0, mysql-1, mysql-2). Pods are created and deleted in order. Each pod can have its own persistent volume that follows it through reschedules.
Use Deployments for: Web servers, API services, workers โ anything that doesn't need to remember who it is.
Use StatefulSets for: Databases (MySQL, PostgreSQL), message queues (Kafka), Elasticsearch โ anything that needs stable identity or persistent storage.
8. Your pods are in CrashLoopBackOff. Walk me through your debugging process.
This is the #1 Kubernetes scenario question. Here is the exact mental model:
Step 1 โ Describe the pod:
kubectl describe pod <pod-name> -n <namespace>Look at: Events section (bottom), Last State, Reason for termination.
Step 2 โ Check logs:
kubectl logs <pod-name> --previous # logs from the crashed container
kubectl logs <pod-name> # logs from current containerStep 3 โ Common causes and fixes:
| Root Cause | Signal | Fix |
|---|---|---|
| App crash | Exit code 1, stack trace in logs | Fix application bug |
| OOMKilled | Exit code 137 | Increase memory limits |
| Config error | "Cannot find config file" | Check ConfigMap/Secret mounts |
| Liveness probe failing | Probe failures in events | Fix probe path or increase threshold |
| Init container failing | Init container status | Debug init container separately |
Section 4: Cost & Architecture
9. Your S3 costs tripled last month. How do you investigate?
Step-by-step framework:
1. AWS Cost Explorer: Filter by S3, break down by usage type (GET requests, PUT requests, storage, data transfer)
2. S3 Storage Lens: Check which buckets grew, which prefixes
3. Access logs: Look for unexpected GET patterns โ could be a crawler, misconfigured application, or public exposure
4. Check lifecycle policies: Did a policy expire that was moving data to Glacier?
5. Data transfer costs: S3 โ internet is expensive. Check if data is crossing regions unnecessarily.
Common causes: Forgot to enable lifecycle policy, application logging to S3 with every request, public bucket being scraped.
10. What is the difference between Spot, On-Demand, and Reserved instances?
| Type | Use Case | Savings | Risk |
|---|---|---|---|
| On-Demand | Unpredictable workloads, production | Baseline | None |
| Reserved (1-3yr) | Predictable baseline load | 40-60% | Locked in |
| Spot | Batch jobs, stateless workers, CI/CD | 70-90% | Can be terminated with 2-min notice |
| Savings Plans | Flexible commitment | 40-60% | Commitment to usage level |
Interview tip: Always mention Spot interruption handling โ use interruption notices, design for graceful shutdown, use mixed instance types in Auto Scaling Groups.
How to Practice These Questions
Reading is not the same as answering under pressure.
The engineers who crack AWS interviews at top companies have one thing in common โ they've delivered these answers out loud, to someone who pushes back, multiple times.
InterviewDrill.io lets you practice exactly this. Paste a real AWS job description, select your role, and Joshua โ your AI interviewer โ asks these exact questions, references your resume, and scores every answer live.
First session is completely free โ interviewdrill.io