Principal DevOps Engineer

$$$$

Job Description

 

Principal DevOps Engineer (Product Support & Customer Operations)

Your Career

As a Principal DevOps Engineer within our On-Prem Platform team, you will play a critical role in ensuring the reliability, operability, and customer success of our on-premise platform powering Cortex products โ€” Cortex XSOAR, Cortex XDR, Cortex XSIAM, and Cortex Cloud.

This role is centered around supporting production deployments, resolving complex customer issues, leading troubleshooting efforts, and driving operational excellence across customer environments.

You will work directly with enterprise customers, customer engineering teams, product teams, and platform engineering to ensure successful deployment and operation of large-scale Kubernetes-based environments.

This position requires a strong hands-on engineering mindset combined with excellent communication skills and the ability to lead technical discussions and customer calls under pressure.

You will own critical incidents end-to-end, identify systemic improvements, and continuously enhance platform stability, automation, and operational efficiency.

 

Your Impact

As a Principal DevOps Engineer, you will:

Product Operations & Customer Support

  • Serve as a senior technical escalation point for complex customer production issues.
  • Lead troubleshooting and resolution of platform incidents across customer environments.
  • Participate in customer-facing technical calls, war rooms, root cause analysis sessions, and operational reviews.
  • Support enterprise deployments and guide customers through installation, upgrades, scaling, and recovery scenarios.
  • Build trusted relationships with customers through technical expertise and operational excellence.

Incident Management & Troubleshooting

  • Drive incident response activities for critical production issues.
  • Perform deep system analysis across infrastructure, Kubernetes clusters, networking, storage, and applications.
  • Conduct root cause analysis (RCA) and implement preventive actions to improve reliability.
  • Develop diagnostic procedures, runbooks, and operational playbooks.
  • Collaborate across Engineering, Product, Support, and Customer Success teams to resolve issues efficiently.

Platform Reliability & Automation

  • Design and maintain highly available, scalable, and secure Kubernetes-based infrastructure.
  • Improve platform operability through automation and self-healing mechanisms.
  • Automate deployment, configuration, and lifecycle management using Infrastructure as Code (IaC).
  • Build operational tooling to reduce manual effort and improve observability.
  • Define monitoring, alerting, and capacity planning strategies.

DevOps & Platform Engineering

  • Design, implement, and optimize Kubernetes deployment architectures.
  • Develop automation frameworks and operational services using Go or Python.
  • Implement and maintain configuration management and provisioning workflows using Ansible.
  • Improve logging, monitoring, and observability across distributed environments.
  • Support storage, backup, disaster recovery, and performance optimization initiatives.

Qualifications

 

Your Experience

  • 5+ years of experience in DevOps, SRE, Platform Engineering, Infrastructure Engineering, or Production Operations roles.
  • Strong hands-on expertise with Linux administration and troubleshooting in production environments.
  • Deep operational experience with Kubernetes architecture, deployment, administration, and troubleshooting.
  • Strong experience with Ansible for automation and infrastructure management.
  • Strong programming and automation skills in Go or Python.
  • Proven experience supporting customer-facing production environments and handling critical incidents.
  • Excellent troubleshooting and debugging skills across infrastructure, networking, containers, and applications.
  • Experience conducting customer calls, technical workshops, and incident management sessions.
  • Strong understanding of:
    • Kubernetes (Helm, Operators, Cluster lifecycle)
    • Linux systems and performance tuning
    • Networking and distributed systems
    • Observability (logs, metrics, tracing)
    • Infrastructure as Code
    • HA and disaster recovery strategies
  • Experience with Kubernetes distributions and tooling (RKE2, Kubespray, Rancher, Helm).
  • Experience working with cloud and hybrid environments (AWS, GCP, Azure is a plus).
  • Strong understanding of DevOps, SRE, and Continuous Delivery practices.
  • Ability to work cross-functionally and influence engineering and product decisions.

Preferred

  • Experience supporting enterprise on-prem deployments.
  • Experience with storage technologies such as Ceph / Rook.
  • Experience operating large-scale distributed systems.
  • Background in cybersecurity or cloud platforms is a plus.

Required languages

English B2 - Upper Intermediate
Published 15 June
68 views
ยท
19 applications
Last responded more than a month ago
See stats of candidates who applied for this job ๐Ÿ‘€
To apply for this and other jobs on Djinni login or signup.
Loading...