Principal Production Engineer
Description
About the Client
Our client is a neocloud purpose-built for AI workloads. We design, deploy, and operate GPU clusters at scale, bare-metal compute, high-performance networking (NVLink, InfiniBand, RoCE), and the software platforms that make them usable for training and inference customers. We move fast, we own our infrastructure end-to-end, and we are building the operational foundation for the next generation of AI compute.
The Role
We are hiring a Principal Production Operations Engineer to serve as part of the senior technical backbone of how we run production. This is a generalist role. You will operate horizontally across systems administration, DevOps, SRE, automation, platform engineering, and vertically from hands-on incident command to multi-quarter architectural strategy and execution.
This is not a people-manager role. It is a senior individual contributor role with org-wide technical influence. You will help set the standards that the rest of the operations org follows, lead the hardest incidents, design the automation and platform investments that determine how we scale, and raise the bar for every engineer around you. We expect you to have already done this work somewhere else, owned production at scale, survived the outages, built the systems that prevented the next ones, and mentored the engineers who now run them.
You will work alongside our network architects, platform engineers, and datacenter delivery team. You will be expected to have strong opinions, defend them with data, and change your mind when the evidence says you should.
What you'll own
Production infrastructure and reliability
- End-to-end ownership of reliability, performance, and operational health across the company's production infrastructure, bare-metal GPU clusters, OpenStack, Kubernetes, storage, and the networking and security layers underneath.
- Technical leadership during major incidents: incident command, root cause analysis, and the durable fixes that ensure the same class of failure does not recur.
- Architectural decisions for how we scale compute, storage, and networking across
multiple datacenters and cluster generations.
- Lifecycle workflows for bare-metal provisioning, node onboarding, decommissioning,
firmware management, and fleet-wide remediation.
- Capacity planning, performance engineering, and the operational readiness reviews that gate new clusters into production.
Automation and platform engineering
- Set the bar for infrastructure-as-code, configuration management, and operational
tooling across the org. Define the patterns; do not just follow them.
- Design and build the automation that eliminates classes of toil, deployment, scaling,
failover, provisioning, firmware, observability, security posture.
- Own the CI/CD, GitOps, and release engineering primitives that production runs on. Make them boring, reliable, and self-service.
- Choose the right tools and frameworks (Terraform, Ansible, Helm, Python, Go, Bash) for the problem in front of us, and know when to build versus buy versus avoid.
- Drive the AI-assisted operations strategy, agentic runbooks, MCP-integrated tooling, and the workflows that let a small team operate a very large fleet.
Observability and operational discipline
- Evolve the observability stack (OpenTelemetry, Prometheus, Grafana, Alertmanager,
Loki, Checkmk) into a platform engineers trust.
- Define what good looks like for instrumentation, SLOs, alerting, and on-call hygiene.
Drive the org toward signal over noise.
- Lead post-incident reviews. Translate findings into concrete engineering work, not action items that die in a doc.
- Participate in the on-call rotation.
Technical leadership and force multiplication
- Mentor senior and mid-level engineers. Raise the technical ceiling of the team without
becoming the bottleneck.
- Write the design docs, the runbooks, and the standards that the rest of the org builds on.
- Prioritize engineering effort against real operational risk. Say no to the work that does
not matter; say yes loudly to the work that does.
- Partner with the leadership on org-level technical strategy.
- Represent operations in customer-facing technical conversations when the situation calls for it.
What we're looking for
Required
- 10+ years in Operations, DevOps, SRE, Production Engineering, or Infrastructure
Engineering, with significant time at the senior or staff level.
- Deep Linux internals knowledge.
- Proven track record owning production systems end-to-end at scale, not just
participating in operations, but defining how operations work.
- Strong programming and automation skills in at least one of Python, Go, or similar. Bash fluency is assumed.
- Hands-on expertise with Terraform, Ansible, or equivalent configuration management at scale.
- Hands-on expertise with modern observability stacks (Prometheus, Grafana, Loki,
OpenTelemetry) instrumentation, SLO definition, alert design, and reducing alert fatigue.
- Networking depth well beyond fundamentals: routing, NAT, conntrack behavior, load
balancing, DNS, TLS, and the ability to troubleshoot when needed.
- Experience leading incident response and post-incident review for high-severity
production events.
- Track record of mentoring engineers and elevating the technical standards of a team or org.
- Comfortable working in ambiguity. You build a structure where there is none and do not
wait to be told what to do.
Strongly preferred
- Production OpenStack experience, particularly Ironic, Nova, Neutron, and the
operational realities of bare-metal provisioning at scale.
- Kubernetes at scale, including GPU-aware workloads, operators, and the failure modes of CNI, CSI, and the control plane.
- GPU compute or AI infrastructure experience, NVIDIA driver and Fabric Manager
troubleshooting, NCCL, InfiniBand or NVLink fabrics, CUDA version management across
fleets.
- Hardware-aware operations: firmware management, BMC/Redfish automation, BIOS
configuration at fleet scale, and the operational discipline that goes with bare metal.
- Startup or scale-up experience. You know what it feels like to build the second version of everything because the first one was duct tape.
- Experience integrating AI-assisted tooling (Claude Code, agentic workflows, MCP
servers) into production operations.
How we work
- We are remote-first. We rely on written communication, async-by-default decision
making, and high-bandwidth pairing when it matters.
- We own our infrastructure. We do not hand off problems to vendors and wait. We
diagnose, we fix, and we write down what we learned.
- We use AI as a force multiplier, not as a replacement for engineering judgment, and not
as a novelty. We expect every engineer to be fluent with it.
- We participate in a 24/7 on-call rotation. Principal engineers are also an escalation point when the rest of the rotation needs help.
AI as a Force Multiplier
Operations is built on the assumption that a small, senior team augmented with AI can outperform a much larger traditional ops org. We expect you to use Claude Code, MCP-integrated tooling, and agentic workflows as core parts of how you investigate, automate, and document. If you have not used these tools in production yet, you should be hungry to. If you have, you should be ready to help define how the rest of the team uses them.
What success looks like
First 30 days
- Onboarded into all production systems, and the active incident backlog. You know where the bodies are buried.
- Shadowed an incident, and contributed at least one durable fix to a recurring operational issue.
- Identified the top three operational risks you would prioritize in the next quarter and started building consensus on them.
First 90 days
- Owning at least one major reliability or automation initiative end-to-end, with measurable impact (toil reduction, MTTR improvement, or reliability gain).
- Driving meaningful improvements to incident response, observability, or platform tooling that the rest of the org has adopted.
- Actively mentoring at least one mid-level or senior engineer.
First 12 months
- Recognized across the company as a technical authority on production operations. Architecture decisions of any significance include you.
- Shipped at least one platform-level investment (automation, observability, or lifecycle) that has materially changed how we operate.
- Raised the operational maturity of the org by a visible step, measured in incident frequency, MTTR, toil, and engineer quality of life.
- Influenced the hiring bar and the technical direction of operations beyond your own scope of work
Required skills experience
| Kubernetes | 5 years |
| Grafana | 5 years |
| SRE | 5 years |
| Ansible | 5 years |
| Python | 5 years |
| Golang | 5 years |
| OpenStack | 5 years |
| DevOps | 5 years |
| Prometheus | 5 years |
| Infrastructure | 5 years |
| CI/CD | 5 years |
| Network troubleshooting | 5 years |
| Linux | 7 years |
Required languages
| English | C1 - Advanced |