About the Role
We're looking for a hands-on Senior DevOps Engineer to own our entire cloud infrastructure on AWS and drive engineering velocity through world-class CI/CD pipelines. You'll be the go-to person for reliability, scalability, security, and developer experience across our high-traffic e-commerce platform handling millions of transactions monthly.
You'll work directly with engineering and product teams to ensure infrastructure scales predictably during peak traffic (flash sales, holidays), maintaining 99.9%+ uptime while keeping costs optimized.
Overview
- Department: Engineering / Infrastructure
- Reports to: CTO / VP Engineering
- Location: Remote (EU/US timezone preferred)
- Experience: 4–8 years in DevOps / Platform Engineering
- Employment: Full-Time
Core Stack
AWS · Terraform · GitHub Actions · Docker · Kubernetes (EKS) · PostgreSQL/Aurora · Redis/ElastiCache · CloudFront · Datadog · ArgoCD
Key Responsibilities
AWS Infrastructure
- Architect, provision, and maintain production AWS infrastructure across multi-AZ, multi-region setups using Terraform
- Manage and optimize ECS/EKS, RDS Aurora, ElastiCache, S3, CloudFront, ALB, Route 53, WAF, SQS/SNS
- Design and enforce VPC architecture, security groups, IAM roles following least-privilege principles
- Implement auto-scaling strategies for flash-sale traffic spikes (EC2 ASGs, ECS task scaling, Lambda concurrency)
- Own infrastructure costs — right-sizing, reserved instance planning, Savings Plans analysis
- Manage secrets and configuration via AWS Secrets Manager and Parameter Store
CI/CD & Developer Experience
- Design and maintain end-to-end CI/CD pipelines (GitHub Actions / GitLab CI) for microservices, frontend, and mobile backends
- Implement blue/green and canary deployments with automated rollback triggers
- Build reusable pipeline templates and shared actions to accelerate developer velocity
- Manage multi-environment promotion workflows (dev → staging → production) with environment-specific gating
- Maintain Docker build pipelines with layer caching, vulnerability scanning (Trivy/Snyk), and ECR lifecycle policies
- Implement feature flag infrastructure for zero-downtime releases
Kubernetes & Container Orchestration
- Administer EKS clusters: node groups, cluster upgrades, add-on lifecycle (CoreDNS, CNI, cluster-autoscaler)
- Design Helm chart templates and ArgoCD manifests for all services
- Implement HPA, VPA, and KEDA for event-driven scaling
- Manage ingress controllers (AWS ALB / NGINX) and certificate management (cert-manager, ACM)
- Enforce resource quotas, network policies, and pod security standards
Observability & Reliability
- Own the full observability stack: metrics (Datadog / Prometheus + Grafana), logs (OpenSearch / CloudWatch), traces (X-Ray / Jaeger)
- Build dashboards and SLO/SLA alerting for critical e-commerce flows (checkout, payment, inventory)
- Lead incident response, runbook creation, and post-mortem culture
- Implement chaos engineering to proactively identify reliability gaps
- Define and track error budgets, MTTR, and deployment frequency as core KPIs
Security & Compliance
- Implement and maintain AWS GuardDuty, Security Hub, Config Rules, Inspector
- Manage SSL/TLS lifecycle, WAF rule sets, and DDoS mitigation via AWS Shield
- Conduct regular security reviews and remediate findings from automated scanning
- Ensure PCI-DSS readiness for payment infrastructure; support SOC 2 audit evidence collection
- Implement secrets rotation, KMS encryption at rest, and encryption in transit
Database & Data Infrastructure
- Manage RDS Aurora PostgreSQL: parameter tuning, read replicas, failover testing
- Administer ElastiCache Redis clusters for session storage and catalog caching
- Own backup strategies, PITR configuration, and DR runbooks
- Collaborate with engineering on connection pooling (PgBouncer/RDS Proxy) and schema migrations
Disaster Recovery
- Design and maintain RTO/RPO targets; conduct quarterly DR drills
- Implement cross-region replication for S3, RDS, and ElastiCache
- Maintain IaC for rapid environment recreation; document runbooks for all critical failure scenarios
Requirements
Must-have:
- 4+ years of hands-on DevOps/SRE/Platform Engineering in production e-commerce or high-traffic SaaS
- Deep AWS expertise — Solutions Architect or DevOps Engineer certification (Professional level preferred)
- Strong Terraform proficiency: module design, state management, Atlantis or Terraform Cloud
- Production EKS experience: cluster operations, Helm, GitOps with ArgoCD or Flux
- CI/CD pipeline expertise with GitHub Actions, GitLab CI, or equivalent
- Solid scripting in Bash and Python
- Docker multi-stage builds, image optimization, and container security scanning
- Strong networking fundamentals: VPC design, subnetting, routing, DNS, load balancing, CDN
- Observability experience: Datadog, Prometheus/Grafana, or equivalent
- Production PostgreSQL and Redis management