Senior Platform Reliability Engineer

Client

Our client is a division of the global business and financial news and information company, It's a leading market index provider and is the owner and distributor of multiple financial services, a dynamic information network with data, news and analytics including cash, derivatives markets, money markets, government and municipal bonds, currencies, commodities, mortgages, indices, insurance, and legal information.

 

Position overview

We’re seeking a Senior Platform Reliability Engineer to keep our Kubernetes-centric provisioning and Linux estate running smoothly. You’ll coordinate fixes when OS builds or upgrades hit exceptions working across teams to find root causes from logs/metrics and recommend changes.

You’ll automate repeat work (Bash/Python), strengthen runbooks and observability, and document configurations and procedures. You’ll be partnering with hands-on engineers and architects in a highly technical, delivery-focused environment.

 

Responsibilities

  • Operate and improve a Kubernetes-centric, open-source platform across provisioning and maintenance workflows.
  • Coordinate resolution of exceptions in a multi-stage (≈10) provisioning pipeline; engage the right owners with clear, actionable context.
  • Build and maintain automation and runbooks (Bash/Python) to reduce toil and increase reliability.
  • Lead triage, log analysis, and root-cause investigation to minimize downtime.
  • Enhance observability (metrics/logs/traces) and promote SLO-oriented practices.
  • Operate and tune distributed data stores (e.g., Cassandra) and platform services.
  • Evolve OS/network provisioning (PXE boot, Subiquity, Foreman, imaging) and server management (BMCs, multi-NIC).
  • Partner with platform teams to improve automation, performance, security, and cost efficiency.
  • Document system configurations, procedures, and changes for repeatability.

 

Requirements

  • Strong Linux administration and troubleshooting (Ubuntu/Debian preferred).
  • Production experience with Kubernetes (or similar orchestrator).
  • Hands-on network/OS provisioning (PXE, Foreman, Subiquity, imaging) and server hardware management (BMCs, multiple NICs).
  • Proficiency in scripting (Bash, Python) for automation and diagnostics.
  • Ability to debug across the stack (infrastructure, workloads, automation, networks) and deliver RCA.
  • Experience with distributed databases (Cassandra or similar).
  • Familiarity with runbooks, incident management, and SRE/reliability practices.
  • Clear communicator and process facilitator: knows whom to engage, what signals to collect, and how to drive issues to closure.
  • CI/CD and IaC mindset (Git and pipelines; Terraform/Ansible a plus).

 

Nice to have

  • Observability stacks (Prometheus, Grafana, ELK/EFK, OpenTelemetry).
  • Workflow systems and retry logic (Argo Workflows, Jenkins).
  • Python for internal tooling (Go a plus).
  • Distributed systems fundamentals (consistency, replication, partition tolerance).
  • Experience operating Cassandra at scale.
  • Experience with Agile development methodologies.
  • Experience working with foreign clients.

Required skills experience

Kubernetes 5 years
Python 5 years
bash 5 years
SRE 4 years
CI/CD 4 years

Required languages

English B2 - Upper Intermediate
Published 12 December
18 views
·
2 applications
100% read
·
100% responded
Last responded 2 hours ago
To apply for this and other jobs on Djinni login or signup.
Loading...