Principal Infrastructure Engineer, Cloud Platform

Full Remote · Worldwide · 10 years of experience · English - C1 · Machine Learning / Big Data

About client Client builds and operates GPU cloud infrastructure at the scale modern AI demands. We run dense accelerated-compute clusters on bare metal, powered by NVIDIA and AMD accelerators and K8s, stitched together with high-performance Ethernet and InfiniBand fabrics, and we deliver them to customers as reliable, secure, high-throughput platforms. We are scaling rapidly across new data center sites, and the infrastructure team sits at the center of that growth, building the systems that turn racks...

About client

Client builds and operates GPU cloud infrastructure at the scale modern AI demands. We run dense accelerated-compute clusters on bare metal, powered by NVIDIA and AMD accelerators and K8s, stitched together with high-performance Ethernet and InfiniBand fabrics, and we deliver them to customers as reliable, secure, high-throughput platforms. We are scaling rapidly across new data center sites, and the infrastructure team sits at the center of that growth, building the systems that turn racks of hardware into a product customers depend on.

The role

We are hiring a Principal Infrastructure Engineer to build the foundation of our greenfield v2 cloud platform: the bare-metal provisioning and control-plane substrate that turn new racks of GPU hardware into reliable, secure, customer-ready capacity.This is a build-focused, hands-on role and the most senior individual contributor on the platform build. You will own the recommendation and the call on the infrastructure layer (we’re currently in POC across different vendors) and the build-vs-buy decision that goes with it, with our Cloud Platform Architect pressure-testing it. You will drive integration across the other domains and define how infrastructure is templated, multi-tenanted, and handed to customers. You will define the architecture and the standards the rest of the team builds to, while still being deep in the code yourself. Day-to-day operations are not your focus, getting the platform designed and built right is.

You will work at the center of new cluster and new-site buildouts, partnering closely with our network, SRE, and security engineers. They own the fabric and day-to-day operations; you own getting the platform built right. Because the systems you design become the foundation customer GPU workloads run on, getting the architecture and automation right the first time matters more here than almost anywhere.

What you’ll own

• Platform architecture and technical direction: own the architecture of the cloud platform control plane and orchestration layer for new GPU clusters, making the foundational decisions and setting the technical standards the rest of the platform is built on.

• Bare-metal provisioning automation: build the automated pipeline that takes a node from rack-and-stack handoff to a fully configured, real acceptance gates before a node is marked customer-ready, firmware to the qualified recipe, GPU burn-in, NCCL/RCC runs, and fabric checks, plus a golden-image build-and-qualify pipeline in CI repeatedly across sites and at scale.

• Infrastructure as code and automation: treat the platform as code from day one. Build the Terraform and Ansible that provision, configure, and validate the platform, so the build is reproducible rather than hand-assembled.

• Control-plane substrate: build the control plane that turns provisioned bare metal into cluster-ready capacity, and own the clean handoff to the Kubernetes layer. You own tenant isolation at the network and hardware layer; the Kubernetes and namespace isolation model belongs to the Principal Kubernetes Platform Engineer.

• DPU and SmartNIC integration: design how we leverage BlueField DPUs in the platform, offload, provisioning, isolation, and the SuperNIC/DPU-mode configurations that let us get the most out of the hardware.

• Platform reliability by design. Build observability, guardrails, and progressive-rollout mechanisms into the platform from the start. As you build it and design to our SOC 2, ISO 27001, and HIPAA obligation.

• Security Built-in: instrument the platform as you build it and design to our SOC 2, ISO 27001, and HIPAA obligations. Vault-based secrets model from the start. BIOS and system prep for trusted domain execution, secure boot, and backend-fabric hardening.

• Confidential-compute-ready platform (growth area): Design the tenancy and scheduling so it fits our confidential-compute roadmap: whole-GPU per tenant, and TDX with GPU passthrough. Deep CC background is a plus, not a requirement.

• New site buildouts: take a critical role in standing up the platform at new data center sites, turning green-field facilities into production capacity.

• Technical leadership: set patterns, review designs, and raise the engineering bar across the infrastructure team multiplying the impact of other engineers and mentoring senior ICs, without becoming a people manager.

What we’re looking for

Required

• 10+ years building production infrastructure at scale, including deep green-field and platform-build experience, for cloud, hosting, or high-performance compute environments.

• A track record of setting technical direction on complex platforms, architecture others adopt, standards others follow, and influence that extends well beyond your own keyboard.

• Strong Linux systems fundamentals, you can reason about the kernel, networking stack, storage, and performance, not just the command line.

• Fluency in infrastructure as code and automation: Terraform, Ansible, and a real programming language used to build durable platform tooling, not just one-off scripts.

• Production Kubernetes experience: building, deploying, and operating clusters and the platform services that run on them.

• A demonstrated bias toward building things right the first time: sound architecture, reproducible automation, and durable design over short-term shortcuts.

• Strong written communication, you document designs and decisions clearly and leave systems more understandable than you found them.

Strongly preferred

• DPU / SmartNIC experience, especially leveraging offload, DPU/SuperNIC modes, and platform designs that exploit them. This is a meaningful differentiator for us.

• Experience with GPU and accelerated-compute infrastructure across NVIDIA and AMD, including ROCm and firmware lifecycle, or HPC/AI infrastructure.

• Bare-metal provisioning at scale (Redfish/BMC automation, NetBox or a similar source of truth).

• Experience standing up cloud-platform control-plane stacks in a green-field setting.

• Networking depth, BGP, VLANs, fabric concepts, and the ability to debug connectivity and throughput issues. You won’t own the network, but the more fluently you can partner with our network engineers, the better.

• Familiarity with high-performance fabrics: RoCE Ethernet / InfiniBand.

• HashiCorp Vault, and building within a SOC 2 / ISO 27001 / HIPAA compliance posture.

How we work

• Small, senior, high-trust teams. We hire people who can own and solve problems end to end and we give them the room to do it.

• Code over clicks. The platform is defined, reviewed, and versioned. Manual changes are the exception and they get automated away.

• Build it right. We invest in sound architecture and durable design so the platform scales cleanly instead of accumulating debt.

• Customers are real. The platform exists to serve customer workloads, and we build to the reliability and white-glove standard that implies.

• We document. Designs, decisions, and architecture are written down so the team compounds knowledge instead of re-learning it.

AI as a force multiplier

We expect our engineers to use AI to move faster and operate at higher leverage, and we build our environment to make that the default rather than an afterthought. We maintain CLAUDE.md and AGENTS.md context files across our repositories, run MCP servers against our operational systems (Jira, NetBox, Checkmk), and use agentic coding tools such as Claude Code directly in our infrastructure and platform workflows.

You don’t need to be an AI expert, but you should be eager to fold these tools into how you design, build, and automate — and to help us push the frontier of what an AI-leveraged infrastructure team can do.

What success looks like

First 30 days

• Ramped on our environment: the platform stack, provisioning approach, and target architecture. able to navigate the codebase and design docs independently.

• Shipped your first meaningful contribution to the platform build with the team’s review.

• Aligned with network, SRE, and security on where your work meets theirs.

First 90 days

• Owning a defined component of the platform build end to end, from design through delivery.

• Delivered a meaningful piece of provisioning or platform automation that made the build faster, more reproducible, or more reliable.

• Shaped at least one foundational architecture decision that the platform is built on.

First 12 months

• The recognized technical authority for the cloud platform build, setting direction and trusted across network, SRE, security, and platform engineering.

• Designed and delivered a significant piece of the cloud platform or its automation that raised the team’s leverage and the platform’s reliability.

• Took a lead role in bringing at least one new data center site online.

More

43 views · 2 applications · 7d

Principal Data Engineer and Architect

Lumnix 🔥

$$$$

Full Remote · Worldwide · 8 years of experience · English - C1 · Fintech

Description Our client is looking for a hands-on Principal Data Engineer & Architect to lead the design, evolution, and execution of their core data engineering ecosystem. This role is built specifically for an elite technical leader who loves mapping out highly reliable distributed architectures but is driven by a strong desire to stay close to the metal—allocating approximately 70% of their time to writing mission-critical production code, core data libraries, and internal frameworks, and 30% to...

Description

Our client is looking for a hands-on Principal Data Engineer & Architect to lead the design, evolution, and execution of their core data engineering ecosystem. This role is built specifically for an elite technical leader who loves mapping out highly reliable distributed architectures but is driven by a strong desire to stay close to the metal—allocating approximately 70% of their time to writing mission-critical production code, core data libraries, and internal frameworks, and 30% to high-level architectural design, data infrastructure governance, and technical mentorship.

About Our Client

Our client is a venture-backed, AI-powered platform built specifically for consumer brands. It connects seamlessly to fragmented external sources like banks, POS systems, and advertising platforms to deliver daily profit & loss snapshots, cash flow plans, and peer benchmarking—all in real time. With integrations stretching across finance, e-commerce, and marketing tools, the platform provides the clarity and automation brands need to grow confidently.

What You’ll Do

Architect Data Infrastructure (70% Focus): Maintain a heavy, hands-on presence in the codebase. Author core data processing libraries, distributed computing frameworks, and high-scale data platforms in JavaScript (Node.js/NestJS) and Python to establish gold standards for the engineering team.

Design Data Systems (30% Focus): Design and evolve a robust, multi-tenant data architecture capable of handling massive datasets and high-throughput processing across hundreds of distinct data domains concurrently.

Data Quality & Governance: Establish bulletproof data integrity and observability patterns. Focus on robust processing workflows, comprehensive data validation, automated quality checks, and end-to-end distributed tracing across the data lifecycle.

Scale Data Pipelines: Define and enforce efficient data architectures, high-throughput streaming patterns, and transaction boundaries to guarantee accurate real-time financial reporting at scale.

Technical Leadership & Mentorship: Act as an elite technical sounding board for the engineering organization. Conduct comprehensive architecture reviews, upskill senior engineers, and champion modern DevOps, testing, and deployment workflows.

What We’re Looking For

Deep Industry Experience: 8+ years of professional software engineering experience, with a proven history of designing, deploying, and maintaining large-scale data processing and distributed data infrastructure solutions.

Polyglot Mastery in Data Engineering: Highly proficient in both JavaScript/TypeScript (Node.js, NestJS) and Python, with the ability to build sophisticated data pipelines and distributed systems using each language’s native strengths.

Advanced Data Engineering & Architecture Expertise: Extensively experienced with distributed systems, high-scale storage solutions, query optimization, and complex data architecture patterns for massive, high-velocity datasets.

Data Modeling & Schema Excellence: Expert-level understanding of relational (SQL) and non-relational (NoSQL) databases, schema design, complex data validation, mapping workflows, and ETL pipeline design.

True Builder Mentality: A strong architectural thinker who actively rejects the “ivory-tower” approach and remains deeply passionate about shipping high-quality production code every single week.

Would Be a Plus

Direct technical experience within FinTech, e-commerce SaaS platforms, or real-time data streaming architectures (e.g., Kafka, RabbitMQ).

Familiarity with financial data aggregates or banking networks (e.g., Plaid, Yodlee, Stripe).

Experience setting up data ingestion pipelines optimized for downstream AI/LLM analysis models.

What We Offer

Highly competitive principal-level base salary + substantial early-stage equity package.

Comprehensive medical, dental, and vision insurance coverage.

Flexible, remote-first working culture with an elite engineering group.

Direct technical influence over the foundational platform architecture of a fast-growing, venture-backed startup.

More

74 views · 14 applications · 11d

Senior Technical Manager Platform Engineering

Lumnix 🔥

$$$$

Full Remote · Worldwide · 8 years of experience · English - C1 · SaaS

About the Client Our client is a neocloud building purpose-built GPU infrastructure for AI workloads. We operate large-scale clusters powering training and inference for some of the most demanding AI customers in the market, and we're rapidly expanding. Our infrastructure runs on NVIDIA and AMD accelerators, InfiniBand and high-speed Ethernet fabrics, and a production stack spanning bare-metal provisioning, Kubernetes, and OpenStack. We're small, senior, and moving fast. The people who do well here own...

About the Client

Our client is a neocloud building purpose-built GPU infrastructure for AI workloads. We operate large-scale clusters powering training and inference for some of the most demanding AI customers in the market, and we're rapidly expanding. Our infrastructure runs on NVIDIA and AMD accelerators, InfiniBand and high-speed Ethernet fabrics, and a production stack spanning bare-metal provisioning, Kubernetes, and OpenStack.

We're small, senior, and moving fast. The people who do well here own problems end-to-end and make decisions with incomplete information.

The role

We're hiring a Senior Manager, Platform Engineering to lead the team that keeps our GPU

clusters running and evolving. You'll own the operational heartbeat of Company's cloud platform, the people, the systems, and the practices that turn racks of GPUs into reliable customer-facing infrastructure.

This is a hands-on leadership role. You'll start roughly 50/50 between technical work and management as you ramp up, earn context, and build trust with the team, and scale toward 80/20 management over your first 12 months as the team grows and your direct reports take on more. We're not looking for a manager who has forgotten how to read a stack trace, and we're not looking for a senior IC with a team tacked on. We're looking for someone who leads by setting technical direction, raising the bar on operational rigor, and growing engineers, and who can still jump into an incident bridge and be useful.

What you'll own

A team of ~9 engineers spanning production engineering, SRE, security, and

automation.

The reliability, performance, and operability of Client's GPU cloud platform across

multiple clusters and customers

Incident response and post-incident culture, you'll set the standard for how we

investigate, communicate, and learn from outages

Operational readiness for new clusters and data center buildouts, in close partnership

with our DC Build, Networking, and Program Management functions

Platform automation and infrastructure-as-code maturity, reducing toil, codifying tribal

knowledge, and making our environment legible to both engineers and AI tooling

Hiring, coaching, and career development for the team, including an anticipated split of the team into specialized functions as we scale

What we're looking for

Required

8+ years of infrastructure, production engineering, or SRE experience, including 3+

years managing or tech-leading engineers

Deep hands-on experience with Linux production systems at scale, you've debugged

kernel, networking, and storage issues in anger, not just read about them

Strong Kubernetes operational experience, you understand what breaks in Kubernetes

at scale and why, not just how to write a deployment manifest

Experience running cloud or cloud-adjacent platforms in production, IaaS, bare-metal, or hybrid, with real customers depending on uptime
Fluency with modern automation and IaC tooling (Ansible, Terraform, or equivalent) and a bias toward codifying operational knowledge rather than keeping it in people's heads
Track record of building and running on-call, incident response, and post-mortem

practices that engineers actually trust

Clear, direct written communication, much of our team and work is async

Strongly preferred

Experience operating GPU or HPC infrastructure, or a strong appetite to go deep on it

quickly

OpenStack, or bare-metal provisioning experience
Experience working alongside networking and security engineers as peers, not as tickets to file
Familiarity with observability stacks (Prometheus, Grafana, Checkmk, or similar) and a

point of view on what good monitoring looks like

Experience scaling a team through rapid growth, splitting functions, hiring against a plan, and evolving reporting structures without breaking trust

How we work

We're remote-first across US, LATAM, and EU time zones
We write things down, decisions, architecture, runbooks, post-mortems
We use AI tooling heavily in day-to-day operations and expect everyone on the team to be fluent with it and to help us get more leverage from it
We ship, we debug, we iterate. We don't process-engineer our way around problems that need to be solved

AI as a force multiplier

We're making a deliberate bet that AI changes the shape of infrastructure teams. Our plan is to scale our platform faster than we scale headcount, using AI-powered development tools to build and refactor automation, AI-assisted testing and validation to improve reliability, and LLM-driven workflows to accelerate investigation, documentation, and operational review.

We're already applying these approaches in production and actively working to turn institutional knowledge into AI-accessible systems that improve operational efficiency and decision-making.
We want a leader who is genuinely excited about this, not someone who tolerates AI tooling because it's in the JD, but someone who will drive it. That means:

Setting the expectation on the team that AI-assisted workflows are the default, not the exception
Identifying where agents can own meaningful slices of work, writing playbooks,

generating tests, validating configurations, triaging alerts, drafting RCAs

Building the substrate the team needs: good documentation, clear schemas, MCP

integrations, and the kind of structured knowledge that makes AI actually useful rather

than a gimmick

Measuring what's working, killing what isn't, and being honest with the team and with

leadership about both

If your reaction to this is "finally, a team that's serious about this," we should talk.

What success looks like

First 30 days: You've built 1:1 relationships with every member of the team, shadowed enough incidents and customer escalations to have a real picture of where we are, and identified the two or three operational patterns most worth changing.

First 90 days: You're the owner of our incident and post-mortem practice. You've onboarded two new senior hires (SRE and Security Engineer) and set clear expectations for their first deliverables. You've made at least one meaningful call on platform architecture or operational tooling that the team agrees was the right one.

First 12 months: The team is measurably more effective, fewer repeat incidents, faster time to resolution, more work automated away. You've hired at least one additional engineer, coached at least one existing team member into a stretch role, and established Platform Engineering as a function that other teams at Company want to work with.

More

241 views · 36 applications · 7d

Junior QA

Lumnix 🔥

$

Full Remote · Countries of Europe or Ukraine · 0.5 years of experience · English - B2 · SaaS

About Us: We are a leading AI-powered platform designed to bring order and transparency to the creator ecosystem. Our mission is to empower creators and consumers alike by providing innovative solutions that enhance brand safety, brand suitability, and transparency in the creator economy. Join us on our journey to revolutionize the digital content landscape with cutting-edge technology. Role Overview: As our first QA Engineer, you will play a crucial role in establishing and implementing quality...

About Us:

We are a leading AI-powered platform designed to bring order and transparency to the creator ecosystem. Our mission is to empower creators and consumers alike by providing innovative solutions that enhance brand safety, brand suitability, and transparency in the creator economy. Join us on our journey to revolutionize the digital content landscape with cutting-edge technology.

Role Overview:

As our first QA Engineer, you will play a crucial role in establishing and implementing quality assurance practices. You will have the chance to set up testing processes, create test plans, and contribute to ensuring the quality of our software product. This role is perfect for someone eager to make a significant impact and develop their QA skills in a dynamic environment.

Key Responsibilities:

Perform manual functional, non-functional, regression, and exploratory testing on web and mobile applications.
Create, execute, and maintain comprehensive test cases and documentation.
Identify, report, and prioritize bugs and issues.
Manage and track defects throughout their lifecycle, ensuring effective communication with development teams.
Conduct release testing including smoke, regression, and post-live testing.
Collaborate with a cross-functional team to understand product requirements and ensure quality standards.
Use tools like JIRA to manage test cases, bugs, and overall test progress.
Analyze test results, identify trends, and contribute to troubleshooting efforts.
Develop and implement QA processes and testing strategies for our software products.

Requirements:

Minimum of 0.5 years of experience in software testing (internships, academic projects, or relevant experience).
Basic understanding of the software development lifecycle (SDLC) and QA methodologies.
Knowledge of JIRA for issue tracking.
Basic understanding of Agile methodologies.
Strong analytical and modeling skills
Outstanding level of attention to detail.
At least Intermediate English proficiency (both spoken and written).
Excellent communication and interpersonal skills.
Self-organized, proactive, and motivated to learn and adapt.

Nice to Have:

Experience with integration testing, and web-service testing.
Understanding of client-server architecture, HTTP basics, and data formats like JSON.
Familiarity with automation testing and basic programming knowledge.
Experience with tools like Insomnia/Postman.

What We Offer:

The unique opportunity to establish QA processes and practices within a growing company.
Hands-on experience with a variety of testing tasks and technologies.
A supportive environment for professional growth with access to training and learning resources.
Mentorship and guidance from experienced professionals to help you develop your skills.
The chance to work closely with developers and other team members to shape the quality of our products.
Flexible working arrangements and a collaborative team culture.

If you are an enthusiastic individual with a passion for quality and a desire to build QA practices from the ground up, while also growing your skills through learning opportunities, we would love to hear from you.

More

232 views · 40 applications · 13d

Incident / Reliability Lead

Lumnix 🔥

$$$$

Full Remote · Worldwide · 7 years of experience · English - C1 · Machine Learning / Big Data

About the client The client is a neocloud building purpose-built GPU infrastructure for AI workloads. We operate large-scale clusters powering training and inference for some of the most demanding AI customers in the market, and we’re rapidly expanding. Our infrastructure runs on NVIDIA and AMD accelerators, InfiniBand and high-speed Ethernet fabrics, and a production stack spanning bare-metal provisioning, Kubernetes, and OpenStack. We’re small, senior, and moving fast. The people who do well at the...

About the client
The client is a neocloud building purpose-built GPU infrastructure for AI workloads. We operate large-scale clusters powering training and inference for some of the most demanding AI customers in the market, and we’re rapidly expanding. Our infrastructure runs on NVIDIA and AMD accelerators, InfiniBand and high-speed Ethernet fabrics, and a production stack spanning bare-metal provisioning, Kubernetes, and OpenStack.

We’re small, senior, and moving fast. The people who do well at the client own problems end-to-end and make decisions with incomplete information.

The role
We are hiring an Incident & Reliability Lead to own the incident and problem management programs at the client. This is the person who runs the call when a customer cluster goes down, drives the post-mortem afterward, and hardens the runbooks so the same failure does not surface twice.

This is not an ITIL paperwork role. You will be technically credible across Kubernetes, Linux, networking, and GPU infrastructure, enough to lead a live incident with platform engineers, and a customer on the bridge. But your center of gravity is the process: fast, clean incident response, honest communication, and a relentless feedback loop between customers engineering.

You will set the standard for how the client behaves during our most challenging hours. Done well, this is one of the most visible and highest-leverage roles on the team.

You will work closely with cloud operations, network teams, and customer-facing counterparts. Expect to be on-call in a shared rotation.

What you’ll own

• Incident command. Run major incidents end-to-end: triage, escalation, comms cadence, timeline, decision log, and drive-to-resolution. You are the conductor, not the fixer.

• Customer communications. Own customer-facing incident messaging, initial acknowledgement, status updates at defined intervals, resolution notice, and written RCA. Clear, honest, no hedging.

• Post-mortem and RCA program. Run blameless post-mortems, write the RCA, track action items to completion, and publish learnings across the org. Own the quality bar for RCA writing at the client.

• Escalation framework. Define and maintain the severity matrix, escalation trees, on-call rotations, and paging policies across platform, network, DC ops, and leadership.

• Runbooks and playbooks. Turn tribal knowledge into durable, testable runbooks. Drive the engineering teams to document before the next 2am page, not after.

• On-call health. Own the on-call experience: page volume, alert quality, rotation fairness, and handoff hygiene. Kill noisy alerts. Fix the ones that matter.

• Reliability metrics. Define and report the metrics that matter: MTTA, MTTR, incident count by class, SLO attainment, repeat-offender systems. Make reliability a number the business trusts.

• Game days and drills. Plan and run incident drills across clusters, regions, and customer scenarios. Surface gaps before customers do.

• Incident tooling. Own the incident tooling stack, paging, status page, incident channel automation, timeline capture, RCA templates. Evaluate and land the right tools.

What we’re looking for

Required

• 7+ years in SRE, production engineering, or infrastructure operations, with clear ownership of incident management in at least one prior role.

• Proven ability to run major incidents as the commander, keeping a room calm, driving decisions, managing comms to customers and executives in parallel.

• Strong technical fluency across Linux, Kubernetes, networking (L2/L3, BGP basics), and cloud or bare-metal infrastructure. Deep enough in the stack to challenge assumptions, debug alongside engineers on a bridge call, and senior enough to turn each incident into durable improvements like runbooks, automation, guardrails, so the next incident does not need you.

• Excellent written communication. You can draft a customer-facing RCA that is accurate, clear, and does not over- or under-promise, and do it within the SLA window.

• Experience owning SLOs, error budgets, or equivalent reliability metrics as a working practice, not a slide.

• Experience designing on-call rotations and alerting practices across multiple teams.

• Comfortable operating in a fast-moving, still-forming environment. You write the runbook that did not exist yesterday.

Strongly preferred

• Experience in GPU infrastructure, HPC, or AI/ML platform operations, NCCL, InfiniBand, DCGM, or similar.

• Direct experience with Kubernetes incident response, control plane, CNI, operator patterns, noisy-neighbor debugging.

• Experience interfacing with neocloud, enterprise or hyperscaler customers during live incidents.

• Familiarity with Prometheus, Checkmk, Grafana, and modern incident tooling (OpsGenie/JIRA).

• Prior experience at a neocloud, CSP, or infrastructure vendor where downtime had direct revenue impact.

How we work

• We’re remote-first across US, LATAM, and EU time zones with a strong operational culture.

• We use AI tooling heavily in day-to-day operations and expect everyone on the team to be fluent with it and to help us get more leverage from it.

• We ship, we debug, we iterate. We don’t process-engineer our way around problems that need to be solved.

• We value written artifacts, RCAs, runbooks, design docs, over meetings. We value calm over performative urgency. We protect focus time and we protect sleep.

AI as a force multiplier

The client runs the infrastructure that powers the AI economy. We expect our team to use AI aggressively in how we work. For this role specifically, that means using AI to draft RCAs faster, triage logs and metrics during incidents, summarize timelines, generate customer-facing communications from raw incident data, and turn post-mortems into runbook updates automatically. If you have not yet built this into your workflow, you will here.

What success looks like

First 30 days

• Shadowed on-call rotations across the platform, network, and ops teams.

• Read every RCA written and identify the top three recurring failure classes.

• Met with the top three customers and understood their incident expectations and SLA terms.

• Documented the current-state incident response process, gaps and all.

First 90 days

• Published a revised severity matrix, escalation tree, and customer-comms cadence, adopted across teams.

• Run at least two major incidents as commander with clean timelines and customer comms.

• Delivered a new RCA template and coached the team through at least three post-mortems against it.

• Stood up baseline reliability metrics (MTTA, MTTR, incident count by class) with a weekly reporting cadence.

• Killed at least 20% of current alert volume by eliminating low-signal pages.

First 12 months

• The client has a mature, documented incident management program that a new customer can diligence with confidence.

• Repeat-offender incident classes are down materially, with clear attribution to runbook, tooling, or engineering changes you drove.

• Customer-facing RCAs are consistently delivered within SLA and are referenced by customers as a differentiator.

• On-call health (page volume, rotation fairness, engineer-reported burden) has measurably improved.

• Game days are a regular cadence and have surfaced and closed real gaps before they became customer incidents.

More

40 views · 1 application · 25d