Site Reliability Engineer (SRE)

$$$$

Intetics Inc., a leading global technology company providing custom software application development, distributed professional teams, software product quality assessment, and “all-things-digital” solutions, is looking for Site Reliability Engineer to work with one of our projects.

We are building the infrastructure layer for modern AI inference: a globally distributed platform designed to deliver scalable, cost-efficient, and reliable access to GPU compute.

Our platform includes globally distributed control planes running on AWS, a large distributed fleet of GPU inference nodes across multiple GPU providers, and the networking systems that connect everything into a reliable, high-performance production platform. We use Kubernetes to orchestrate inference workloads and an open-source observability stack built around Prometheus, Grafana, Loki, Jaeger, and OpenTelemetry.

The control plane is relatively simple and primarily runs on EKS, with MySQL on AWS and a small set of supporting services. The more complex part of the platform is the GPU side: hosts are spread across providers and are often operated more like distributed on-prem infrastructure than traditional cloud services.

Role

We’re hiring an SRE / DevOps / Infrastructure Engineer to help scale and operate the core infrastructure behind our growing inference platform. This is a hands-on role for someone who wants to work on hard infrastructure problems across networking, Kubernetes, reliability, observability, and production operations. You’ll join a small, high-impact infrastructure team that is currently two engineers, with plans to grow to 4–5 by the end of the year.

 

What you’ll do

·      Build, operate, and improve the infrastructure powering the distributed inference platform

·      Own reliability, scalability, and operational excellence across AWS-based control planes and our multi-provider GPU fleet

·      Design and maintain the networking layer connecting control planes, Kubernetes clusters, and geographically distributed GPU hosts

·      Operate and improve Kubernetes-based inference orchestration, primarily on EKS

·      Manage deployments and infrastructure changes using Helm, FluxCD, and Terraform

·      Improve observability across the platform using metrics, logs, traces, dashboards, and alerting built on Prometheus, Grafana, Loki, Jaeger, and OpenTelemetry

·      Tune alerts, improve runbooks, and strengthen operational readiness as the system scales

·      Respond to production issues, perform root cause analysis, and implement durable fixes

·      Work closely with engineers across time zones using clear asynchronous communication and handoff practices, especially through Slack

·      Help expand Europe-based infrastructure coverage to support sustainable operations outside US business hours.
 

Requirements

·      5+ years of experience in SRE, DevOps, platform engineering, or infrastructure engineering

·      Strong production experience with networking and Kubernetes

·      Experience operating AWS infrastructure in production, especially EKS

·      Strong hands-on experience managing Linux hosts, clusters, and distributed systems in environments that are not fully abstracted by a major cloud provider

·      Experience with Prometheus, Grafana, Loki, Jaeger, and OpenTelemetry

·      Experience with deployment and GitOps workflows using tools such as Helm and FluxCD

·      Experience with infrastructure as code, ideally Terraform

·      Familiarity with alert tuning, runbook development, and practical incident management in production systems

·      Strong operational judgment: able to troubleshoot independently, respond calmly to incidents, and improve systems without constant direction

·      Comfortable working in a fast-moving startup where infrastructure, product, and customer demands are changing quickly

·      Clear communicator who can work effectively in an async environment and handle shift handoffs cleanly

Nice to have

·      Experience with AI inference, ML infrastructure, or adjacent high-performance distributed systems

·      Experience operating heterogeneous GPU fleets, bare-metal infrastructure, or multi-provider compute environments

·      Experience using AI tools productively in engineering workflows

Required languages

English B2 - Upper Intermediate
Published 20 March
123 views
·
27 applications
Response activity: Low
Last responded 1 week ago
See stats of candidates who applied for this job 👀
To apply for this and other jobs on Djinni login or signup.
Loading...