Site Reliability Engineer (SRE)
Intetics Inc., a leading global technology company providing custom software application development, distributed professional teams, software product quality assessment, and “all-things-digital” solutions, is looking for Site Reliability Engineer to work with one of our projects.
We are building the infrastructure layer for modern AI inference: a globally distributed platform designed to deliver scalable, cost-efficient, and reliable access to GPU compute.
Our platform includes globally distributed control planes running on AWS, a large distributed fleet of GPU inference nodes across multiple GPU providers, and the networking systems that connect everything into a reliable, high-performance production platform. We use Kubernetes to orchestrate inference workloads and an open-source observability stack built around Prometheus, Grafana, Loki, Jaeger, and OpenTelemetry.
The control plane is relatively simple and primarily runs on EKS, with MySQL on AWS and a small set of supporting services. The more complex part of the platform is the GPU side: hosts are spread across providers and are often operated more like distributed on-prem infrastructure than traditional cloud services.
Role
We’re hiring an SRE / DevOps / Infrastructure Engineer to help scale and operate the core infrastructure behind our growing inference platform. This is a hands-on role for someone who wants to work on hard infrastructure problems across networking, Kubernetes, reliability, observability, and production operations. You’ll join a small, high-impact infrastructure team that is currently two engineers, with plans to grow to 4–5 by the end of the year.
What you’ll do
· Build, operate, and improve the infrastructure powering the distributed inference platform
· Own reliability, scalability, and operational excellence across AWS-based control planes and our multi-provider GPU fleet
· Design and maintain the networking layer connecting control planes, Kubernetes clusters, and geographically distributed GPU hosts
· Operate and improve Kubernetes-based inference orchestration, primarily on EKS
· Manage deployments and infrastructure changes using Helm, FluxCD, and Terraform
· Improve observability across the platform using metrics, logs, traces, dashboards, and alerting built on Prometheus, Grafana, Loki, Jaeger, and OpenTelemetry
· Tune alerts, improve runbooks, and strengthen operational readiness as the system scales
· Respond to production issues, perform root cause analysis, and implement durable fixes
· Work closely with engineers across time zones using clear asynchronous communication and handoff practices, especially through Slack
· Help expand Europe-based infrastructure coverage to support sustainable operations outside US business hours.
Requirements
· 5+ years of experience in SRE, DevOps, platform engineering, or infrastructure engineering
· Strong production experience with networking and Kubernetes
· Experience operating AWS infrastructure in production, especially EKS
· Strong hands-on experience managing Linux hosts, clusters, and distributed systems in environments that are not fully abstracted by a major cloud provider
· Experience with Prometheus, Grafana, Loki, Jaeger, and OpenTelemetry
· Experience with deployment and GitOps workflows using tools such as Helm and FluxCD
· Experience with infrastructure as code, ideally Terraform
· Familiarity with alert tuning, runbook development, and practical incident management in production systems
· Strong operational judgment: able to troubleshoot independently, respond calmly to incidents, and improve systems without constant direction
· Comfortable working in a fast-moving startup where infrastructure, product, and customer demands are changing quickly
· Clear communicator who can work effectively in an async environment and handle shift handoffs cleanly
Nice to have
· Experience with AI inference, ML infrastructure, or adjacent high-performance distributed systems
· Experience operating heterogeneous GPU fleets, bare-metal infrastructure, or multi-provider compute environments
· Experience using AI tools productively in engineering workflows
Required languages
| English | B2 - Upper Intermediate |