Lumnix

Senior Technical Manager Platform Engineering

Lumnix Responds Quickly
$$$$

About the Client

Our client is a neocloud building purpose-built GPU infrastructure for AI workloads. We operate large-scale clusters powering training and inference for some of the most demanding AI customers in the market, and we're rapidly expanding. Our infrastructure runs on NVIDIA and AMD accelerators, InfiniBand and high-speed Ethernet fabrics, and a production stack spanning bare-metal provisioning, Kubernetes, and OpenStack.

We're small, senior, and moving fast. The people who do well here own problems end-to-end and make decisions with incomplete information.
 

The role

We're hiring a Senior Manager, Platform Engineering to lead the team that keeps our GPU

clusters running and evolving. You'll own the operational heartbeat of Company's cloud platform, the people, the systems, and the practices that turn racks of GPUs into reliable customer-facing infrastructure.

This is a hands-on leadership role. You'll start roughly 50/50 between technical work and management as you ramp up, earn context, and build trust with the team, and scale toward 80/20 management over your first 12 months as the team grows and your direct reports take on more. We're not looking for a manager who has forgotten how to read a stack trace, and we're not looking for a senior IC with a team tacked on. We're looking for someone who leads by setting technical direction, raising the bar on operational rigor, and growing engineers, and who can still jump into an incident bridge and be useful.
 

What you'll own

  • A team of ~9 engineers spanning production engineering, SRE, security, and

automation.

  • The reliability, performance, and operability of Client's GPU cloud platform across

multiple clusters and customers

  • Incident response and post-incident culture, you'll set the standard for how we

investigate, communicate, and learn from outages

  • Operational readiness for new clusters and data center buildouts, in close partnership

with our DC Build, Networking, and Program Management functions

  • Platform automation and infrastructure-as-code maturity, reducing toil, codifying tribal

knowledge, and making our environment legible to both engineers and AI tooling

  • Hiring, coaching, and career development for the team, including an anticipated split of the team into specialized functions as we scale
     

What we're looking for

Required

  • 8+ years of infrastructure, production engineering, or SRE experience, including 3+

years managing or tech-leading engineers

  • Deep hands-on experience with Linux production systems at scale, you've debugged

kernel, networking, and storage issues in anger, not just read about them

  • Strong Kubernetes operational experience, you understand what breaks in Kubernetes

at scale and why, not just how to write a deployment manifest

  • Experience running cloud or cloud-adjacent platforms in production, IaaS, bare-metal, or hybrid, with real customers depending on uptime
  • Fluency with modern automation and IaC tooling (Ansible, Terraform, or equivalent) and a bias toward codifying operational knowledge rather than keeping it in people's heads
  • Track record of building and running on-call, incident response, and post-mortem

practices that engineers actually trust

  • Clear, direct written communication, much of our team and work is async

     

Strongly preferred

  • Experience operating GPU or HPC infrastructure, or a strong appetite to go deep on it

quickly

  • OpenStack, or bare-metal provisioning experience
  • Experience working alongside networking and security engineers as peers, not as tickets to file
  • Familiarity with observability stacks (Prometheus, Grafana, Checkmk, or similar) and a

point of view on what good monitoring looks like

  • Experience scaling a team through rapid growth, splitting functions, hiring against a plan, and evolving reporting structures without breaking trust
     

How we work

  • We're remote-first across US, LATAM, and EU time zones
  • We write things down, decisions, architecture, runbooks, post-mortems
  • We use AI tooling heavily in day-to-day operations and expect everyone on the team to be fluent with it and to help us get more leverage from it
  • We ship, we debug, we iterate. We don't process-engineer our way around problems that need to be solved
     

AI as a force multiplier

We're making a deliberate bet that AI changes the shape of infrastructure teams. Our plan is to scale our platform faster than we scale headcount, using AI-powered development tools to build and refactor automation, AI-assisted testing and validation to improve reliability, and LLM-driven workflows to accelerate investigation, documentation, and operational review.

We're already applying these approaches in production and actively working to turn institutional knowledge into AI-accessible systems that improve operational efficiency and decision-making.
We want a leader who is genuinely excited about this, not someone who tolerates AI tooling because it's in the JD, but someone who will drive it. That means:

  • Setting the expectation on the team that AI-assisted workflows are the default, not the exception
  • Identifying where agents can own meaningful slices of work, writing playbooks,

generating tests, validating configurations, triaging alerts, drafting RCAs

  • Building the substrate the team needs: good documentation, clear schemas, MCP

integrations, and the kind of structured knowledge that makes AI actually useful rather

than a gimmick

  • Measuring what's working, killing what isn't, and being honest with the team and with

leadership about both

If your reaction to this is "finally, a team that's serious about this," we should talk.
 

What success looks like

 

First 30 days: You've built 1:1 relationships with every member of the team, shadowed enough incidents and customer escalations to have a real picture of where we are, and identified the two or three operational patterns most worth changing.

First 90 days: You're the owner of our incident and post-mortem practice. You've onboarded two new senior hires (SRE and Security Engineer) and set clear expectations for their first deliverables. You've made at least one meaningful call on platform architecture or operational tooling that the team agrees was the right one.

First 12 months: The team is measurably more effective, fewer repeat incidents, faster time to resolution, more work automated away. You've hired at least one additional engineer, coached at least one existing team member into a stretch role, and established Platform Engineering as a function that other teams at Company want to work with.

Required skills experience

Engineering Management 5 years
Linux 5 years
Kubernetes 5 years
SRE 5 years
Infrastructure 5 years
Terraform 5 years
Ansible 5 years
Python 5 years
OpenStack 5 years
Prometheus 5 years
Incident Management 5 years

Required languages

English C1 - Advanced
Published 13 June
21 views
ยท
0 applications
To apply for this and other jobs on Djinni login or signup.
Loading...