Site Reliability Engineer / DevOps Engineer $$$$

Eastern Peak Verified Employer

We are looking for a Site Reliability Engineer / DevOps Engineer to help scale and operate the core infrastructure behind a rapidly growing AI infrastructure platform. You will join an expanding infrastructure team focused on building reliable, scalable, and cost-efficient systems that power production workloads across a distributed environment.

 

This role is ideal for someone who enjoys working close to infrastructure, solving complex operational challenges, and improving system reliability in a fast-moving startup environment.

 

Requirements
- 5+ years of experience in SRE, DevOps, platform engineering, or infrastructure engineering
- Strong production experience with Kubernetes and networking
- Hands-on experience operating AWS infrastructure, especially EKS
- Strong Linux and distributed systems experience, including environments beyond fully managed cloud abstractions
- Experience with Prometheus, Grafana, Loki, Jaeger, and OpenTelemetry
- Experience with GitOps and deployment workflows (Helm, FluxCD or similar)
- Infrastructure as Code experience, ideally Terraform
- Experience with alert tuning, runbooks, and incident management practices
- Clear communicator able to work effectively in async, distributed teams
- Excellent English (spoken and written).

 

Nice to Have
- Experience with AI inference, ML infrastructure, or high-performance distributed systems
- Experience managing GPU fleets, bare-metal infrastructure, or multi-provider compute environments
- Experience using AI tools to improve engineering productivity.

 

Responsibilities:
- Build, operate, and continuously improve the infrastructure powering a distributed inference platform
- Own reliability, scalability, and operational excellence across AWS-based control planes and a multi-provider GPU fleet
- Design and maintain networking between control planes, Kubernetes clusters, and geographically distributed GPU hosts
- Operate and enhance Kubernetes-based orchestration (primarily EKS)
- Manage deployments and infrastructure using Helm, FluxCD, and Terraform
- Improve observability through metrics, logs, traces, dashboards, and alerting (Prometheus, Grafana, Loki, Jaeger, OpenTelemetry)
- Tune alerts, develop runbooks, and strengthen operational readiness as systems scale
- Respond to production incidents, perform root cause analysis, and implement long-term fixes
- Collaborate with globally distributed engineers using clear asynchronous communication and structured handoffs.

 

About the Project
The product is a US-based platform providing scalable and reliable access to GPU compute through a globally distributed infrastructure. It enables customers to run production AI inference workloads efficiently across multiple providers.

Required skills experience

Kubernetes 3 years
AWS EKS 2 years
Networking 3 years

Required languages

English B2 - Upper Intermediate
Ukrainian Native
Published 26 March
32 views
ยท
7 applications
Last responded 29 minutes ago
To apply for this and other jobs on Djinni login or signup.
Loading...