Senior Platform Reliability Engineer
Client
Our client is a division of the global business and financial news and information company, It's a leading market index provider and is the owner and distributor of multiple financial services, a dynamic information network with data, news and analytics including cash, derivatives markets, money markets, government and municipal bonds, currencies, commodities, mortgages, indices, insurance, and legal information.
Position overview
We’re seeking a Senior Platform Reliability Engineer to keep our Kubernetes-centric provisioning and Linux estate running smoothly. You’ll coordinate fixes when OS builds or upgrades hit exceptions working across teams to find root causes from logs/metrics and recommend changes.
You’ll automate repeat work (Bash/Python), strengthen runbooks and observability, and document configurations and procedures. You’ll be partnering with hands-on engineers and architects in a highly technical, delivery-focused environment.
Responsibilities
- Operate and improve a Kubernetes-centric, open-source platform across provisioning and maintenance workflows.
- Coordinate resolution of exceptions in a multi-stage (≈10) provisioning pipeline; engage the right owners with clear, actionable context.
- Build and maintain automation and runbooks (Bash/Python) to reduce toil and increase reliability.
- Lead triage, log analysis, and root-cause investigation to minimize downtime.
- Enhance observability (metrics/logs/traces) and promote SLO-oriented practices.
- Operate and tune distributed data stores (e.g., Cassandra) and platform services.
- Evolve OS/network provisioning (PXE boot, Subiquity, Foreman, imaging) and server management (BMCs, multi-NIC).
- Partner with platform teams to improve automation, performance, security, and cost efficiency.
- Document system configurations, procedures, and changes for repeatability.
Requirements
- Strong Linux administration and troubleshooting (Ubuntu/Debian preferred).
- Production experience with Kubernetes (or similar orchestrator).
- Hands-on network/OS provisioning (PXE, Foreman, Subiquity, imaging) and server hardware management (BMCs, multiple NICs).
- Proficiency in scripting (Bash, Python) for automation and diagnostics.
- Ability to debug across the stack (infrastructure, workloads, automation, networks) and deliver RCA.
- Experience with distributed databases (Cassandra or similar).
- Familiarity with runbooks, incident management, and SRE/reliability practices.
- Clear communicator and process facilitator: knows whom to engage, what signals to collect, and how to drive issues to closure.
- CI/CD and IaC mindset (Git and pipelines; Terraform/Ansible a plus).
Nice to have
- Observability stacks (Prometheus, Grafana, ELK/EFK, OpenTelemetry).
- Workflow systems and retry logic (Argo Workflows, Jenkins).
- Python for internal tooling (Go a plus).
- Distributed systems fundamentals (consistency, replication, partition tolerance).
- Experience operating Cassandra at scale.
- Experience with Agile development methodologies.
- Experience working with foreign clients.
Required skills experience
| Kubernetes | 5 years |
| Python | 5 years |
| bash | 5 years |
| SRE | 4 years |
| CI/CD | 4 years |
Required languages
| English | B2 - Upper Intermediate |