Lumnix

Senior Site Reliability Engineer (SRE)

Lumnix Responds Quickly
$$$$

Our client is a neocloud building purpose-built GPU infrastructure for AI workloads. They operate large-scale clusters powering training and inference for some of the most demanding AI customers in the market, and they're rapidly expanding. Their infrastructure runs on NVIDIA and AMD accelerators, InfiniBand and high-speed Ethernet fabrics, and a production stack spanning bare-metal provisioning, Kubernetes, and OpenStack.

They're small, senior, and moving fast. The people who do well with our client own problems end-to-end and make decisions with incomplete information.

 

The role

We are hiring a Senior SRE to own the incident and problem management programs for our client. This is the person who runs the call when a customer cluster goes down, drives the post-mortem afterward, and hardens the runbooks so the same failure does not surface twice.

This is not an ITIL paperwork role. You will be technically credible across Kubernetes, Linux, networking, and GPU infrastructure, enough to lead a live incident with platform engineers, and a customer on the bridge. But your center of gravity is the process: fast, clean incident response, honest communication, and a relentless feedback loop between customers and engineering.

 

You will set the standard for how our client behaves during their most challenging hours. Done well, this is one of the most visible and highest-leverage roles on the team. You will work closely with cloud operations, network teams, and customer-facing counterparts. Expect to be on-call in a shared rotation.

 

What you'll own

 

  • Incident command. Run major incidents end-to-end: triage, escalation, comms cadence, timeline, decision log, and drive-to-resolution. You are the conductor, not the fixer.
  • Customer communications. Own customer-facing incident messaging, initial acknowledgement, status updates at defined intervals, resolution notice, and written RCA. Clear, honest, no hedging.
  • Post-mortem and RCA program. Run blameless post-mortems, write the RCA, track action items to completion, and publish learnings across the org. Own the quality bar for RCA writing for our client.
  • Escalation framework. Define and maintain the severity matrix, escalation trees, on-call rotations, and paging policies across platform, network, DC ops, and leadership.
  • Runbooks and playbooks. Turn tribal knowledge into durable, testable runbooks. Drive the engineering teams to document before the next 2am page, not after.
  • On-call health. Own the on-call experience: page volume, alert quality, rotation fairness, and handoff hygiene. Kill noisy alerts. Fix the ones that matter.
  • Reliability metrics. Define and report the metrics that matter: MTTA, MTTR, incident count by class, SLO attainment, repeat-offender systems. Make reliability a number the business trusts.
  • Game days and drills. Plan and run incident drills across clusters, regions, and customer scenarios. Surface gaps before customers do.
  • Incident tooling. Own the incident tooling stack, paging, status page, incident channel automation, timeline capture, RCA templates. Evaluate and land the right tools.

 

What we're looking for

 

Required

  • 7+ years in SRE, production engineering, or infrastructure operations, with clear ownership of incident management in at least one prior role.
  • Proven ability to run major incidents as the commander, keeping a room calm, driving decisions, managing comms to customers and executives in parallel.
  • Strong technical fluency across Linux, Kubernetes, networking ($L2/L3$, BGP basics), and cloud or bare-metal infrastructure.
  • Deep enough in the stack to challenge assumptions, debug alongside engineers on a bridge call, and senior enough to turn each incident into durable improvements like runbooks, automation, guardrails, so the next incident does not need you.
  • Excellent written communication. You can draft a customer-facing RCA that is accurate, clear, and does not over- or under-promise, and do it within the SLA window.
  • Experience owning SLOs, error budgets, or equivalent reliability metrics as a working practice, not a slide.
  • Experience designing on-call rotations and alerting practices across multiple teams. Comfortable operating in a fast-moving, still-forming environment. You write the runbook that did not exist yesterday.

     

Strongly preferred

  • Experience in GPU infrastructure, HPC, or AI/ML platform operations, NCCL, InfiniBand, DCGM, or similar.
  • Direct experience with Kubernetes incident response, control plane, CNI, operator patterns, noisy-neighbor debugging.
  • Experience interfacing with neocloud, enterprise or hyperscaler customers during live incidents.
  • Familiarity with Prometheus, Checkmk, Grafana, and modern incident tooling (OpsGenie/JIRA).
  • Prior experience at a neocloud, CSP, or infrastructure vendor where downtime had direct revenue impact.

 

How we work

  • We're remote-first across US, LATAM, and EU time zones with a strong operational culture.
  • We use AI tooling heavily in day-to-day operations and expect everyone on the team to be fluent with it and to help us get more leverage from it.
  • We ship, we debug, we iterate. We don't process-engineer our way around problems that need to be solved.
  • We value written artifacts, RCAs, runbooks, design docs, over meetings. We value calm over performative urgency.
  • We protect focus time and we protect sleep.

 

AI as a force multiplier

Our client runs the infrastructure that powers the AI economy. We expect our team to use AI aggressively in how we work. For this role specifically, that means using AI to draft RCAs faster, triage logs and metrics during incidents, summarize timelines, generate customer-facing communications from raw incident data, and turn post-mortems into runbook updates automatically. If you have not yet built this into your workflow, you will here.

 

What success looks like

 

First 30 days

  • Shadowed on-call rotations across the platform, network, and ops teams.
  • Read every RCA written and identify the top three recurring failure classes.
  • Met with the top three customers and understood their incident expectations and SLA terms.
  • Documented the current-state incident response process, gaps and all.

 

First 90 days

  • Published a revised severity matrix, escalation tree, and customer-comms cadence, adopted across teams.
  • Run at least two major incidents as commander with clean timelines and customer comms.
  • Delivered a new RCA template and coached the team through at least three post-mortems against it.
  • Stood up baseline reliability metrics (MTTA, MTTR, incident count by class) with a weekly reporting cadence.
  • Killed at least 20% of current alert volume by eliminating low-signal pages.

 

`First 12 months

  • Our client has a mature, documented incident management program that a new customer can diligence with confidence.
  • Repeat-offender incident classes are down materially, with clear attribution to runbook, tooling, or engineering changes you drove.
  • Customer-facing RCAs are consistently delivered within SLA and are referenced by customers as a differentiator.
  • On-call health (page volume, rotation fairness, engineer-reported burden) has measurably improved.
  • Game days are a regular cadence and have surfaced and closed real gaps before they became customer incidents.

Required skills experience

Kubernetes 5 years
Linux 4 years
Incident Management 3 years
Product Operations 5 years
Networking: TCP/IP 3 years
Root Cause Analysis 3 years

Required languages

English C1 - Advanced
Published 8 June
13 views
ยท
3 applications
Last responded 4 days ago
To apply for this and other jobs on Djinni login or signup.
Loading...