Site Reliability Engineer / DevOps Engineer $$$$

Eastern Peak Verified Employer

We are looking for a Site Reliability Engineer / DevOps Engineer to help scale and operate the core infrastructure behind a rapidly growing AI infrastructure platform. You will join an expanding infrastructure team focused on building reliable, scalable, and cost-efficient systems that power production workloads across a distributed environment.

This role is ideal for someone who enjoys working close to infrastructure, solving complex operational challenges, and improving system reliability in a fast-moving startup environment.

Requirements
- 5+ years of experience in SRE, DevOps, platform engineering, or infrastructure engineering
- Strong production experience with Kubernetes and networking
- Hands-on experience operating AWS infrastructure, especially EKS
- Strong Linux and distributed systems experience, including environments beyond fully managed cloud abstractions
- Experience with Prometheus, Grafana, Loki, Jaeger, and OpenTelemetry
- Experience with GitOps and deployment workflows (Helm, FluxCD or similar)
- Infrastructure as Code experience, ideally Terraform
- Experience with alert tuning, runbooks, and incident management practices
- Clear communicator able to work effectively in async, distributed teams
- Excellent English (spoken and written).

Nice to Have
- Experience with AI inference, ML infrastructure, or high-performance distributed systems
- Experience managing GPU fleets, bare-metal infrastructure, or multi-provider compute environments
- Experience using AI tools to improve engineering productivity.

Responsibilities:
- Build, operate, and continuously improve the infrastructure powering a distributed inference platform
- Own reliability, scalability, and operational excellence across AWS-based control planes and a multi-provider GPU fleet
- Design and maintain networking between control planes, Kubernetes clusters, and geographically distributed GPU hosts
- Operate and enhance Kubernetes-based orchestration (primarily EKS)
- Manage deployments and infrastructure using Helm, FluxCD, and Terraform
- Improve observability through metrics, logs, traces, dashboards, and alerting (Prometheus, Grafana, Loki, Jaeger, OpenTelemetry)
- Tune alerts, develop runbooks, and strengthen operational readiness as systems scale
- Respond to production incidents, perform root cause analysis, and implement long-term fixes
- Collaborate with globally distributed engineers using clear asynchronous communication and structured handoffs.

About the Project
The product is a US-based platform providing scalable and reliable access to GPU compute through a globally distributed infrastructure. It enables customers to run production AI inference workloads efficiently across multiple providers.

Required skills experience

Kubernetes	3 years
AWS EKS	2 years
Networking	3 years

Required languages

English	B2 - Upper Intermediate
Ukrainian	Native

Published 26 March

32 views

7 applications

Last responded 29 minutes ago

To apply for this and other jobs on Djinni login or signup.

Only from 5 years of experience
Full Remote
Countries of Europe or Ukraine
Countries where we consider candidates
- English B2 - Upper Intermediate
- Ukrainian Native

DevOps

Kubernetes	3 years
AWS EKS	2 years
Networking	3 years

Employment: Fulltime
Domain: Machine Learning / Big Data
Outstaff

Apply for the job

Last responded 29 minutes ago

📊 $3600-5500 Average salary range of similar jobs in analytics →