Senior Technical Manager Platform Engineering
About the Client
Our client is a neocloud building purpose-built GPU infrastructure for AI workloads. We operate large-scale clusters powering training and inference for some of the most demanding AI customers in the market, and we're rapidly expanding. Our infrastructure runs on NVIDIA and AMD accelerators, InfiniBand and high-speed Ethernet fabrics, and a production stack spanning bare-metal provisioning, Kubernetes, and OpenStack.
We're small, senior, and moving fast. The people who do well here own problems end-to-end and make decisions with incomplete information.
The role
We're hiring a Senior Manager, Platform Engineering to lead the team that keeps our GPU
clusters running and evolving. You'll own the operational heartbeat of Company's cloud platform, the people, the systems, and the practices that turn racks of GPUs into reliable customer-facing infrastructure.
This is a hands-on leadership role. You'll start roughly 50/50 between technical work and management as you ramp up, earn context, and build trust with the team, and scale toward 80/20 management over your first 12 months as the team grows and your direct reports take on more. We're not looking for a manager who has forgotten how to read a stack trace, and we're not looking for a senior IC with a team tacked on. We're looking for someone who leads by setting technical direction, raising the bar on operational rigor, and growing engineers, and who can still jump into an incident bridge and be useful.
What you'll own
- A team of ~9 engineers spanning production engineering, SRE, security, and
automation.
- The reliability, performance, and operability of Client's GPU cloud platform across
multiple clusters and customers
- Incident response and post-incident culture, you'll set the standard for how we
investigate, communicate, and learn from outages
- Operational readiness for new clusters and data center buildouts, in close partnership
with our DC Build, Networking, and Program Management functions
- Platform automation and infrastructure-as-code maturity, reducing toil, codifying tribal
knowledge, and making our environment legible to both engineers and AI tooling
- Hiring, coaching, and career development for the team, including an anticipated split of the team into specialized functions as we scale
What we're looking for
Required
- 8+ years of infrastructure, production engineering, or SRE experience, including 3+
years managing or tech-leading engineers
- Deep hands-on experience with Linux production systems at scale, you've debugged
kernel, networking, and storage issues in anger, not just read about them
- Strong Kubernetes operational experience, you understand what breaks in Kubernetes
at scale and why, not just how to write a deployment manifest
- Experience running cloud or cloud-adjacent platforms in production, IaaS, bare-metal, or hybrid, with real customers depending on uptime
- Fluency with modern automation and IaC tooling (Ansible, Terraform, or equivalent) and a bias toward codifying operational knowledge rather than keeping it in people's heads
- Track record of building and running on-call, incident response, and post-mortem
practices that engineers actually trust
- Clear, direct written communication, much of our team and work is async
Strongly preferred
- Experience operating GPU or HPC infrastructure, or a strong appetite to go deep on it
quickly
- OpenStack, or bare-metal provisioning experience
- Experience working alongside networking and security engineers as peers, not as tickets to file
- Familiarity with observability stacks (Prometheus, Grafana, Checkmk, or similar) and a
point of view on what good monitoring looks like
- Experience scaling a team through rapid growth, splitting functions, hiring against a plan, and evolving reporting structures without breaking trust
How we work
- We're remote-first across US, LATAM, and EU time zones
- We write things down, decisions, architecture, runbooks, post-mortems
- We use AI tooling heavily in day-to-day operations and expect everyone on the team to be fluent with it and to help us get more leverage from it
- We ship, we debug, we iterate. We don't process-engineer our way around problems that need to be solved
AI as a force multiplier
We're making a deliberate bet that AI changes the shape of infrastructure teams. Our plan is to scale our platform faster than we scale headcount, using AI-powered development tools to build and refactor automation, AI-assisted testing and validation to improve reliability, and LLM-driven workflows to accelerate investigation, documentation, and operational review.
We're already applying these approaches in production and actively working to turn institutional knowledge into AI-accessible systems that improve operational efficiency and decision-making.
We want a leader who is genuinely excited about this, not someone who tolerates AI tooling because it's in the JD, but someone who will drive it. That means:
- Setting the expectation on the team that AI-assisted workflows are the default, not the exception
- Identifying where agents can own meaningful slices of work, writing playbooks,
generating tests, validating configurations, triaging alerts, drafting RCAs
- Building the substrate the team needs: good documentation, clear schemas, MCP
integrations, and the kind of structured knowledge that makes AI actually useful rather
than a gimmick
- Measuring what's working, killing what isn't, and being honest with the team and with
leadership about both
If your reaction to this is "finally, a team that's serious about this," we should talk.
What success looks like
First 30 days: You've built 1:1 relationships with every member of the team, shadowed enough incidents and customer escalations to have a real picture of where we are, and identified the two or three operational patterns most worth changing.
First 90 days: You're the owner of our incident and post-mortem practice. You've onboarded two new senior hires (SRE and Security Engineer) and set clear expectations for their first deliverables. You've made at least one meaningful call on platform architecture or operational tooling that the team agrees was the right one.
First 12 months: The team is measurably more effective, fewer repeat incidents, faster time to resolution, more work automated away. You've hired at least one additional engineer, coached at least one existing team member into a stretch role, and established Platform Engineering as a function that other teams at Company want to work with.
Required skills experience
| Engineering Management | 5 years |
| Linux | 5 years |
| Kubernetes | 5 years |
| SRE | 5 years |
| Infrastructure | 5 years |
| Terraform | 5 years |
| Ansible | 5 years |
| Python | 5 years |
| OpenStack | 5 years |
| Prometheus | 5 years |
| Incident Management | 5 years |
Required languages
| English | C1 - Advanced |