Site Reliability Engineer (SRE)

$$$$

Intetics Inc., a leading global technology company providing custom software application development, distributed professional teams, software product quality assessment, and “all-things-digital” solutions, is looking for Site Reliability Engineer to work with one of our projects.

We are building the infrastructure layer for modern AI inference: a globally distributed platform designed to deliver scalable, cost-efficient, and reliable access to GPU compute.

Our platform includes globally distributed control planes running on AWS, a large distributed fleet of GPU inference nodes across multiple GPU providers, and the networking systems that connect everything into a reliable, high-performance production platform. We use Kubernetes to orchestrate inference workloads and an open-source observability stack built around Prometheus, Grafana, Loki, Jaeger, and OpenTelemetry.

The control plane is relatively simple and primarily runs on EKS, with MySQL on AWS and a small set of supporting services. The more complex part of the platform is the GPU side: hosts are spread across providers and are often operated more like distributed on-prem infrastructure than traditional cloud services.

Role

We’re hiring an SRE / DevOps / Infrastructure Engineer to help scale and operate the core infrastructure behind our growing inference platform. This is a hands-on role for someone who wants to work on hard infrastructure problems across networking, Kubernetes, reliability, observability, and production operations. You’ll join a small, high-impact infrastructure team that is currently two engineers, with plans to grow to 4–5 by the end of the year.

What you’ll do

· Build, operate, and improve the infrastructure powering the distributed inference platform

· Own reliability, scalability, and operational excellence across AWS-based control planes and our multi-provider GPU fleet

· Design and maintain the networking layer connecting control planes, Kubernetes clusters, and geographically distributed GPU hosts

· Operate and improve Kubernetes-based inference orchestration, primarily on EKS

· Manage deployments and infrastructure changes using Helm, FluxCD, and Terraform

· Improve observability across the platform using metrics, logs, traces, dashboards, and alerting built on Prometheus, Grafana, Loki, Jaeger, and OpenTelemetry

· Tune alerts, improve runbooks, and strengthen operational readiness as the system scales

· Respond to production issues, perform root cause analysis, and implement durable fixes

· Work closely with engineers across time zones using clear asynchronous communication and handoff practices, especially through Slack

· Help expand Europe-based infrastructure coverage to support sustainable operations outside US business hours.

Requirements

· 5+ years of experience in SRE, DevOps, platform engineering, or infrastructure engineering

· Strong production experience with networking and Kubernetes

· Experience operating AWS infrastructure in production, especially EKS

· Strong hands-on experience managing Linux hosts, clusters, and distributed systems in environments that are not fully abstracted by a major cloud provider

· Experience with Prometheus, Grafana, Loki, Jaeger, and OpenTelemetry

· Experience with deployment and GitOps workflows using tools such as Helm and FluxCD

· Experience with infrastructure as code, ideally Terraform

· Familiarity with alert tuning, runbook development, and practical incident management in production systems

· Strong operational judgment: able to troubleshoot independently, respond calmly to incidents, and improve systems without constant direction

· Comfortable working in a fast-moving startup where infrastructure, product, and customer demands are changing quickly

· Clear communicator who can work effectively in an async environment and handle shift handoffs cleanly

Nice to have

· Experience with AI inference, ML infrastructure, or adjacent high-performance distributed systems

· Experience operating heterogeneous GPU fleets, bare-metal infrastructure, or multi-provider compute environments

· Experience using AI tools productively in engineering workflows

Required languages

English

B2 - Upper Intermediate

Published 20 March

123 views

27 applications

Response activity: Low

Last responded 1 week ago

See stats of candidates who applied for this job 👀

See applicant insights

To apply for this and other jobs on Djinni login or signup.

Only from 4 years of experience
Full Remote
Countries of Europe or Ukraine
Countries where we consider candidates
- English B2 - Upper Intermediate

DevOps

Employment: Fulltime
Domain: Other
Outsource

Apply for the job

Response activity: Low

Last responded 1 week ago

📊 Average salary range of similar jobs in analytics →