Site Reliability Engineer / DevOps Engineer $$$$
We are looking for a Site Reliability Engineer / DevOps Engineer to help scale and operate the core infrastructure behind a rapidly growing AI infrastructure platform. You will join an expanding infrastructure team focused on building reliable, scalable, and cost-efficient systems that power production workloads across a distributed environment.
This role is ideal for someone who enjoys working close to infrastructure, solving complex operational challenges, and improving system reliability in a fast-moving startup environment.
Requirements
- 5+ years of experience in SRE, DevOps, platform engineering, or infrastructure engineering
- Strong production experience with Kubernetes and networking
- Hands-on experience operating AWS infrastructure, especially EKS
- Strong Linux and distributed systems experience, including environments beyond fully managed cloud abstractions
- Experience with Prometheus, Grafana, Loki, Jaeger, and OpenTelemetry
- Experience with GitOps and deployment workflows (Helm, FluxCD or similar)
- Infrastructure as Code experience, ideally Terraform
- Experience with alert tuning, runbooks, and incident management practices
- Clear communicator able to work effectively in async, distributed teams
- Excellent English (spoken and written).
Nice to Have
- Experience with AI inference, ML infrastructure, or high-performance distributed systems
- Experience managing GPU fleets, bare-metal infrastructure, or multi-provider compute environments
- Experience using AI tools to improve engineering productivity.
Responsibilities:
- Build, operate, and continuously improve the infrastructure powering a distributed inference platform
- Own reliability, scalability, and operational excellence across AWS-based control planes and a multi-provider GPU fleet
- Design and maintain networking between control planes, Kubernetes clusters, and geographically distributed GPU hosts
- Operate and enhance Kubernetes-based orchestration (primarily EKS)
- Manage deployments and infrastructure using Helm, FluxCD, and Terraform
- Improve observability through metrics, logs, traces, dashboards, and alerting (Prometheus, Grafana, Loki, Jaeger, OpenTelemetry)
- Tune alerts, develop runbooks, and strengthen operational readiness as systems scale
- Respond to production incidents, perform root cause analysis, and implement long-term fixes
- Collaborate with globally distributed engineers using clear asynchronous communication and structured handoffs.
About the Project
The product is a US-based platform providing scalable and reliable access to GPU compute through a globally distributed infrastructure. It enables customers to run production AI inference workloads efficiently across multiple providers.
Required skills experience
| Kubernetes | 3 years |
| AWS EKS | 2 years |
| Networking | 3 years |
Required languages
| English | B2 - Upper Intermediate |
| Ukrainian | Native |