Senior Technical Manager Platform Engineering

Lumnix 🔥

$$$$

About the Client

Our client is a neocloud building purpose-built GPU infrastructure for AI workloads. We operate large-scale clusters powering training and inference for some of the most demanding AI customers in the market, and we're rapidly expanding. Our infrastructure runs on NVIDIA and AMD accelerators, InfiniBand and high-speed Ethernet fabrics, and a production stack spanning bare-metal provisioning, Kubernetes, and OpenStack.

We're small, senior, and moving fast. The people who do well here own problems end-to-end and make decisions with incomplete information.

The role

We're hiring a Senior Manager, Platform Engineering to lead the team that keeps our GPU

clusters running and evolving. You'll own the operational heartbeat of Company's cloud platform, the people, the systems, and the practices that turn racks of GPUs into reliable customer-facing infrastructure.

This is a hands-on leadership role. You'll start roughly 50/50 between technical work and management as you ramp up, earn context, and build trust with the team, and scale toward 80/20 management over your first 12 months as the team grows and your direct reports take on more. We're not looking for a manager who has forgotten how to read a stack trace, and we're not looking for a senior IC with a team tacked on. We're looking for someone who leads by setting technical direction, raising the bar on operational rigor, and growing engineers, and who can still jump into an incident bridge and be useful.

What you'll own

A team of ~9 engineers spanning production engineering, SRE, security, and

automation.

The reliability, performance, and operability of Client's GPU cloud platform across

multiple clusters and customers

Incident response and post-incident culture, you'll set the standard for how we

investigate, communicate, and learn from outages

Operational readiness for new clusters and data center buildouts, in close partnership

with our DC Build, Networking, and Program Management functions

Platform automation and infrastructure-as-code maturity, reducing toil, codifying tribal

knowledge, and making our environment legible to both engineers and AI tooling

Hiring, coaching, and career development for the team, including an anticipated split of the team into specialized functions as we scale

What we're looking for

Required

8+ years of infrastructure, production engineering, or SRE experience, including 3+

years managing or tech-leading engineers

Deep hands-on experience with Linux production systems at scale, you've debugged

kernel, networking, and storage issues in anger, not just read about them

Strong Kubernetes operational experience, you understand what breaks in Kubernetes

at scale and why, not just how to write a deployment manifest

Experience running cloud or cloud-adjacent platforms in production, IaaS, bare-metal, or hybrid, with real customers depending on uptime
Fluency with modern automation and IaC tooling (Ansible, Terraform, or equivalent) and a bias toward codifying operational knowledge rather than keeping it in people's heads
Track record of building and running on-call, incident response, and post-mortem

practices that engineers actually trust

Clear, direct written communication, much of our team and work is async

Strongly preferred

Experience operating GPU or HPC infrastructure, or a strong appetite to go deep on it

quickly

OpenStack, or bare-metal provisioning experience
Experience working alongside networking and security engineers as peers, not as tickets to file
Familiarity with observability stacks (Prometheus, Grafana, Checkmk, or similar) and a

point of view on what good monitoring looks like

Experience scaling a team through rapid growth, splitting functions, hiring against a plan, and evolving reporting structures without breaking trust

How we work

We're remote-first across US, LATAM, and EU time zones
We write things down, decisions, architecture, runbooks, post-mortems
We use AI tooling heavily in day-to-day operations and expect everyone on the team to be fluent with it and to help us get more leverage from it
We ship, we debug, we iterate. We don't process-engineer our way around problems that need to be solved

AI as a force multiplier

We're making a deliberate bet that AI changes the shape of infrastructure teams. Our plan is to scale our platform faster than we scale headcount, using AI-powered development tools to build and refactor automation, AI-assisted testing and validation to improve reliability, and LLM-driven workflows to accelerate investigation, documentation, and operational review.

We're already applying these approaches in production and actively working to turn institutional knowledge into AI-accessible systems that improve operational efficiency and decision-making.
We want a leader who is genuinely excited about this, not someone who tolerates AI tooling because it's in the JD, but someone who will drive it. That means:

Setting the expectation on the team that AI-assisted workflows are the default, not the exception
Identifying where agents can own meaningful slices of work, writing playbooks,

generating tests, validating configurations, triaging alerts, drafting RCAs

Building the substrate the team needs: good documentation, clear schemas, MCP

integrations, and the kind of structured knowledge that makes AI actually useful rather

than a gimmick

Measuring what's working, killing what isn't, and being honest with the team and with

leadership about both

If your reaction to this is "finally, a team that's serious about this," we should talk.

What success looks like

First 30 days: You've built 1:1 relationships with every member of the team, shadowed enough incidents and customer escalations to have a real picture of where we are, and identified the two or three operational patterns most worth changing.

First 90 days: You're the owner of our incident and post-mortem practice. You've onboarded two new senior hires (SRE and Security Engineer) and set clear expectations for their first deliverables. You've made at least one meaningful call on platform architecture or operational tooling that the team agrees was the right one.

First 12 months: The team is measurably more effective, fewer repeat incidents, faster time to resolution, more work automated away. You've hired at least one additional engineer, coached at least one existing team member into a stretch role, and established Platform Engineering as a function that other teams at Company want to work with.

Required skills experience

Engineering Management 5 years

Linux 5 years

Kubernetes 5 years

SRE 5 years

Infrastructure 5 years

+ 6 more

Terraform 5 years

Ansible 5 years

OpenStack 5 years

Prometheus 5 years

Incident Management 5 years

Python 5 years

Required languages

English C1 - Advanced

Published 13 June · Updated 14 July

249 views

38 applications

Response activity: High

Last responded 2 days ago

See stats of candidates who applied for this job 👀

See applicant insights

To apply for this and other jobs on Djinni login or signup.

Only from 8 years of experience
Full Remote
Worldwide
Countries where we consider candidates
- English C1 - Advanced

Engineering Manager

Engineering Management	5 years
Linux	5 years
Kubernetes	5 years

+ 8 more

Employment: Fulltime
Domain: SaaS
Mixed

Response activity: High

Last responded 2 days ago

📊 Average salary range of similar jobs in analytics →