Principal DevOps Engineer (AWS)
Who we are:
Adaptiq is a technology hub specializing in building, scaling, and supporting R&D teams for high-end, fast-growing product companies in a wide range of industries.
About the Product:
An AI-native cloud platform that powers intelligent, real-time experiences for enterprise customers at scale. Built on a modern microservices architecture, the platform leverages advanced AI capabilities to process large volumes of data securely and reliably across multiple regions. The engineering team focuses on cloud-native infrastructure, Kubernetes, infrastructure as code, GitOps-driven deployments, observability, and efficient orchestration of AI workloads in production. The platform is designed to meet high standards for security, scalability, availability, and regulatory compliance while supporting rapid product growth.
About the Role:
This staff-level DevOps Engineer role owns the end-to-end platform infrastructure during a major platform expansion and rapid growth phase. You will lead the design, implementation, and operation of production-grade AWS environments, applying proven engineering best practices to build a secure, scalable, and highly available platform. Working closely with cross-functional engineering teams, you will help ensure the platform meets demanding reliability, security, and compliance requirements while enabling the delivery of large-scale AI-powered services. This position offers significant ownership and the opportunity to shape infrastructure supporting mission-critical production workloads.
Key Responsibilities:
- Design, build, and maintain scalable AWS cloud infrastructure supporting containerized microservices.
- Own production Kubernetes environments, including auto-scaling, cost optimization, and runtime security.
- Centralize and clean up CI/CD and GitOps pipelines (ArgoCD or similar), enforce RBAC, and ensure safe, rapid deployments.
- Implement infrastructure as code with Terraform and Terragrunt, enforce PR-based apply flows, and manage state across modules.
- Establish isolated, multi-account setups for production, staging, and operations workloads.
- Deploy and operate multi-region infrastructure to meet per-country data residency and compliance requirements.
- Enhance reliability and observability using consolidated tooling, OpenTelemetry, and incident management best practices.
- Strengthen security posture with secret management, least-privilege access, SSO-as-code, and tenant isolation.
- Support compliance initiatives by implementing practical security controls and maintaining audit readiness.
- Build and optimize infrastructure for LLM-heavy inference workloads, including cost controls and provider failover.
- Automate repetitive operations, author runbooks, and mentor team members on DevOps practices.
Required Competence and Skills:
- 8+ years of experience in DevOps, SRE, or platform engineering at a staff or equivalent level.
- Proven track record as the lead or sole owner of production infrastructure in a fast-paced environment.
- Extensive hands-on AWS experience in production, including multi-account and multi-region architectures.
- Deep expertise in running and securing Kubernetes on EKS under real-world load conditions.
- Strong proficiency with Infrastructure as Code using Terraform and Terragrunt (CloudFormation experience is a plus).
- Experience with CI/CD and GitOps tools (ArgoCD, Flux, or similar) and managing RBAC policies.
- Skilled in monitoring and observability platforms such as Datadog, Prometheus/Grafana, or BetterStack.
- Proficient in scripting or programming languages (Python, Bash, Go, TypeScript, or similar) for automation tasks.
- Ability to diagnose and resolve complex issues across infrastructure and application layers.
- Experience integrating and leveraging AI-assisted tools to accelerate development, debugging, and documentation.
- High ownership mindset, strong initiative, and the judgment to act with urgency without compromising quality.
Nice to Have:
- Hands-on experience supporting security compliance programs and external audits.
- Background in handling regulated or sensitive data (children’s data, PII, health, or financial).
- Experience with multi-tenant SaaS architectures and strong tenant isolation strategies.
- Proven infrastructure work for AI or LLM-based products at scale, including GPU clusters.
- Production PostgreSQL operations at scale.
- Familiarity with IaC delivery tools such as Atlantis.