Senior Platform/Site Reliability Engineer

$$$$

Our client is a rapidly growing enterprise software organization that acquires and scales B2B SaaS products. They are building a shared cloud platform that serves as the engineering foundation for a growing portfolio of enterprise applications. This platform provides standardized infrastructure, deployment, observability, automation, and reliability capabilities across multiple products while enabling future growth without proportionally increasing operational complexity.

The organization is investing in modern platform engineering practices, cloud-native technologies, Infrastructure as Code, AI-assisted engineering, and operational automation to build a scalable, highly reliable engineering ecosystem.

They are looking for an experienced Senior Platform & Site Reliability Engineer to take ownership of the shared platform, establish engineering standards, and design the infrastructure that supports multiple enterprise SaaS products. This is a hands-on technical leadership role where you will influence platform architecture, developer experience, operational reliability, and engineering best practices.

Working Hours: This role requires daily collaboration with a U.S.-based engineering team. Candidates must be available to work until at least 3:00 PM EST (U.S. Eastern Time), with flexibility to work beyond these hours when business needs require.

Responsibilities

Platform Engineering

  • Own the architecture and operation of the shared platform, including CI/CD, observability, deployment automation, secrets management, and developer tooling.
  • Define, implement, and enforce platform engineering standards across multiple products.
  • Build and maintain Infrastructure as Code using Terraform or OpenTofu, ensuring all infrastructure is version-controlled, reviewed, and provisioned through automation.
  • Develop self-service platform capabilities that enable engineering teams to deploy independently.

Event Streaming & Data Processing

  • Design and maintain event streaming infrastructure supporting real-time processing workloads.
  • Build and support batch processing infrastructure alongside live transactional systems.
  • Ensure reliability, scalability, performance, and cost efficiency of platform services.

CI/CD & Deployment

  • Design, build, and maintain CI/CD pipelines using GitHub Actions.
  • Automate recovery for common pipeline failures and improve deployment reliability.
  • Implement release management strategies, rollback mechanisms, and deployment patterns such as canary or blue-green deployments where appropriate.

Observability & Site Reliability

  • Own and maintain the observability platform using Grafana, Prometheus, Loki, CloudWatch, and related monitoring tools.
  • Define Service Level Objectives (SLOs), error budgets, and reliability metrics across multiple products.
  • Build intelligent alerting and monitoring solutions that provide actionable diagnostic information.
  • Design incident response processes, escalation procedures, and post-incident review practices.
  • Implement safe automated remediation for well-understood operational scenarios while ensuring human oversight for complex incidents.

Platform Expansion & Integration

  • Assess newly onboarded products for infrastructure maturity, Infrastructure as Code coverage, observability, and security.
  • Plan and execute platform integration and modernization initiatives while minimizing operational disruption.
  • Support the adoption of standardized platform capabilities across multiple engineering teams.

Engineering Automation

  • Leverage AI-assisted engineering tools and automation where appropriate to reduce operational overhead.
  • Automate infrastructure provisioning, CI/CD workflows, monitoring, secrets management, and operational tasks while maintaining engineering oversight for high-impact decisions.

Preferred Technology Stack

  • AWS
  • Terraform / OpenTofu
  • GitHub Actions
  • Grafana
  • Prometheus
  • Loki
  • AWS CloudWatch
  • AWS Secrets Manager or HashiCorp Vault
  • Amazon ECS and EKS
  • Event streaming technologies
  • Cost monitoring and cloud optimization tools

Requirements

  • 8โ€“12 years of experience in Platform Engineering, Site Reliability Engineering (SRE), DevOps, or Cloud Infrastructure Engineering.
  • Proven experience designing and operating production platform infrastructure across multiple environments or products.
  • Strong hands-on experience with Terraform (or OpenTofu) and Infrastructure as Code.
  • Extensive experience designing and maintaining CI/CD pipelines using GitHub Actions.
  • Experience operating event streaming infrastructure in production environments.
  • Strong AWS expertise, including ECS, EKS, IAM, VPC, RDS, CloudWatch, networking, and cloud infrastructure.
  • Hands-on experience with Grafana, Prometheus, Loki, and enterprise observability platforms.
  • Strong understanding of SRE principles, including SLOs, error budgets, incident response, and operational excellence.
  • Experience designing scalable, secure, highly available cloud infrastructure.
  • Strong troubleshooting, automation, and problem-solving skills.
  • Excellent communication skills with the ability to establish engineering standards across multiple teams.

Nice to Have

  • Experience building shared platform engineering capabilities supporting multiple products or business units.
  • Experience integrating newly acquired products or modernizing legacy platforms.
  • Experience designing developer self-service platforms.
  • Familiarity with AI-assisted engineering workflows and infrastructure automation.
  • Experience supporting high-volume enterprise SaaS products and distributed systems.
  • Strong focus on cloud cost optimization and operational efficiency.

What We Offer

  • Competitive market salary.
  • Fully remote work.
  • Opportunity to build and shape the engineering platform supporting a growing portfolio of enterprise SaaS products.
  • Work alongside experienced international engineering teams.
  • Exposure to modern cloud technologies, AI-assisted engineering, automation, and large-scale platform initiatives.
  • Professional growth through ownership of platform architecture, operational reliability, and engineering standards.
  • Daily collaboration with a U.S.-based engineering team, with availability required until at least 3:00 PM EST and flexibility to work longer when needed.

Required languages

English C2 - Proficient
Published 30 June
9 views
ยท
0 applications
To apply for this and other jobs on Djinni login or signup.
Loading...