Principal Production Engineer

Lumnix 🔥 Responds Quickly

$$$$

Description

About the Client

Our client is a neocloud purpose-built for AI workloads. We design, deploy, and operate GPU clusters at scale, bare-metal compute, high-performance networking (NVLink, InfiniBand, RoCE), and the software platforms that make them usable for training and inference customers. We move fast, we own our infrastructure end-to-end, and we are building the operational foundation for the next generation of AI compute.

The Role

We are hiring a Principal Production Operations Engineer to serve as part of the senior technical backbone of how we run production. This is a generalist role. You will operate horizontally across systems administration, DevOps, SRE, automation, platform engineering, and vertically from hands-on incident command to multi-quarter architectural strategy and execution.

This is not a people-manager role. It is a senior individual contributor role with org-wide technical influence. You will help set the standards that the rest of the operations org follows, lead the hardest incidents, design the automation and platform investments that determine how we scale, and raise the bar for every engineer around you. We expect you to have already done this work somewhere else, owned production at scale, survived the outages, built the systems that prevented the next ones, and mentored the engineers who now run them.

You will work alongside our network architects, platform engineers, and datacenter delivery team. You will be expected to have strong opinions, defend them with data, and change your mind when the evidence says you should.

What you'll own

Production infrastructure and reliability

End-to-end ownership of reliability, performance, and operational health across the company's production infrastructure, bare-metal GPU clusters, OpenStack, Kubernetes, storage, and the networking and security layers underneath.
Technical leadership during major incidents: incident command, root cause analysis, and the durable fixes that ensure the same class of failure does not recur.
Architectural decisions for how we scale compute, storage, and networking across

multiple datacenters and cluster generations.

Lifecycle workflows for bare-metal provisioning, node onboarding, decommissioning,

firmware management, and fleet-wide remediation.

Capacity planning, performance engineering, and the operational readiness reviews that gate new clusters into production.

Automation and platform engineering

Set the bar for infrastructure-as-code, configuration management, and operational

tooling across the org. Define the patterns; do not just follow them.

Design and build the automation that eliminates classes of toil, deployment, scaling,

failover, provisioning, firmware, observability, security posture.

Own the CI/CD, GitOps, and release engineering primitives that production runs on. Make them boring, reliable, and self-service.
Choose the right tools and frameworks (Terraform, Ansible, Helm, Python, Go, Bash) for the problem in front of us, and know when to build versus buy versus avoid.
Drive the AI-assisted operations strategy, agentic runbooks, MCP-integrated tooling, and the workflows that let a small team operate a very large fleet.

Observability and operational discipline

Evolve the observability stack (OpenTelemetry, Prometheus, Grafana, Alertmanager,

Loki, Checkmk) into a platform engineers trust.

Define what good looks like for instrumentation, SLOs, alerting, and on-call hygiene.

Drive the org toward signal over noise.

Lead post-incident reviews. Translate findings into concrete engineering work, not action items that die in a doc.
Participate in the on-call rotation.

Technical leadership and force multiplication

Mentor senior and mid-level engineers. Raise the technical ceiling of the team without

becoming the bottleneck.

Write the design docs, the runbooks, and the standards that the rest of the org builds on.
Prioritize engineering effort against real operational risk. Say no to the work that does

not matter; say yes loudly to the work that does.

Partner with the leadership on org-level technical strategy.
Represent operations in customer-facing technical conversations when the situation calls for it.

What we're looking for

Required

10+ years in Operations, DevOps, SRE, Production Engineering, or Infrastructure

Engineering, with significant time at the senior or staff level.

Deep Linux internals knowledge.
Proven track record owning production systems end-to-end at scale, not just

participating in operations, but defining how operations work.

Strong programming and automation skills in at least one of Python, Go, or similar. Bash fluency is assumed.
Hands-on expertise with Terraform, Ansible, or equivalent configuration management at scale.
Hands-on expertise with modern observability stacks (Prometheus, Grafana, Loki,

OpenTelemetry) instrumentation, SLO definition, alert design, and reducing alert fatigue.

Networking depth well beyond fundamentals: routing, NAT, conntrack behavior, load

balancing, DNS, TLS, and the ability to troubleshoot when needed.

Experience leading incident response and post-incident review for high-severity

production events.

Track record of mentoring engineers and elevating the technical standards of a team or org.
Comfortable working in ambiguity. You build a structure where there is none and do not

wait to be told what to do.

Strongly preferred

Production OpenStack experience, particularly Ironic, Nova, Neutron, and the

operational realities of bare-metal provisioning at scale.

Kubernetes at scale, including GPU-aware workloads, operators, and the failure modes of CNI, CSI, and the control plane.
GPU compute or AI infrastructure experience, NVIDIA driver and Fabric Manager

troubleshooting, NCCL, InfiniBand or NVLink fabrics, CUDA version management across

fleets.

Hardware-aware operations: firmware management, BMC/Redfish automation, BIOS

configuration at fleet scale, and the operational discipline that goes with bare metal.

Startup or scale-up experience. You know what it feels like to build the second version of everything because the first one was duct tape.
Experience integrating AI-assisted tooling (Claude Code, agentic workflows, MCP

servers) into production operations.

How we work

We are remote-first. We rely on written communication, async-by-default decision

making, and high-bandwidth pairing when it matters.

We own our infrastructure. We do not hand off problems to vendors and wait. We

diagnose, we fix, and we write down what we learned.

We use AI as a force multiplier, not as a replacement for engineering judgment, and not

as a novelty. We expect every engineer to be fluent with it.

We participate in a 24/7 on-call rotation. Principal engineers are also an escalation point when the rest of the rotation needs help.

AI as a Force Multiplier

Operations is built on the assumption that a small, senior team augmented with AI can outperform a much larger traditional ops org. We expect you to use Claude Code, MCP-integrated tooling, and agentic workflows as core parts of how you investigate, automate, and document. If you have not used these tools in production yet, you should be hungry to. If you have, you should be ready to help define how the rest of the team uses them.

What success looks like

First 30 days

Onboarded into all production systems, and the active incident backlog. You know where the bodies are buried.
Shadowed an incident, and contributed at least one durable fix to a recurring operational issue.
Identified the top three operational risks you would prioritize in the next quarter and started building consensus on them.

First 90 days

Owning at least one major reliability or automation initiative end-to-end, with measurable impact (toil reduction, MTTR improvement, or reliability gain).
Driving meaningful improvements to incident response, observability, or platform tooling that the rest of the org has adopted.
Actively mentoring at least one mid-level or senior engineer.

First 12 months

Recognized across the company as a technical authority on production operations. Architecture decisions of any significance include you.
Shipped at least one platform-level investment (automation, observability, or lifecycle) that has materially changed how we operate.
Raised the operational maturity of the org by a visible step, measured in incident frequency, MTTR, toil, and engineer quality of life.
Influenced the hiring bar and the technical direction of operations beyond your own scope of work

Required skills experience

Kubernetes 5 years

Grafana 5 years

SRE 5 years

Ansible 5 years

Golang 5 years

+ 8 more

OpenStack 5 years

DevOps 5 years

Prometheus 5 years

Infrastructure 5 years

CI/CD 5 years

Network troubleshooting 5 years

Linux 7 years

Python 5 years

Required languages

English C1 - Advanced

Published 13 June

45 views

0 applications

To apply for this and other jobs on Djinni login or signup.

Only from 10 years of experience
Full Remote
EU
Countries where we consider candidates
- English C1 - Advanced

Data Engineer

Kubernetes	5 years
Grafana	5 years
SRE	5 years

+ 10 more

Employment: Fulltime
Domain: SaaS
Startup

Apply for the job

📊 Average salary range of similar jobs in analytics →