Senior Platform/Site Reliability Engineer

$$$$

Our client is a rapidly growing enterprise software organization that acquires and scales B2B SaaS products. They are building a shared cloud platform that serves as the engineering foundation for a growing portfolio of enterprise applications. This platform provides standardized infrastructure, deployment, observability, automation, and reliability capabilities across multiple products while enabling future growth without proportionally increasing operational complexity.

The organization is investing in modern platform engineering practices, cloud-native technologies, Infrastructure as Code, AI-assisted engineering, and operational automation to build a scalable, highly reliable engineering ecosystem.

They are looking for an experienced Senior Platform & Site Reliability Engineer to take ownership of the shared platform, establish engineering standards, and design the infrastructure that supports multiple enterprise SaaS products. This is a hands-on technical leadership role where you will influence platform architecture, developer experience, operational reliability, and engineering best practices.

Working Hours: This role requires daily collaboration with a U.S.-based engineering team. Candidates must be available to work until at least 5:00 PM EST (U.S. Eastern Time), with flexibility to work beyond these hours when business needs require.

Responsibilities

Platform Engineering

Own the architecture and operation of the shared platform, including CI/CD, observability, deployment automation, secrets management, and developer tooling.
Define, implement, and enforce platform engineering standards across multiple products.
Build and maintain Infrastructure as Code using Terraform or OpenTofu, ensuring all infrastructure is version-controlled, reviewed, and provisioned through automation.
Develop self-service platform capabilities that enable engineering teams to deploy independently.

Event Streaming & Data Processing

Design and maintain event streaming infrastructure supporting real-time processing workloads.
Build and support batch processing infrastructure alongside live transactional systems.
Ensure reliability, scalability, performance, and cost efficiency of platform services.

CI/CD & Deployment

Design, build, and maintain CI/CD pipelines using GitHub Actions.
Automate recovery for common pipeline failures and improve deployment reliability.
Implement release management strategies, rollback mechanisms, and deployment patterns such as canary or blue-green deployments where appropriate.

Observability & Site Reliability

Own and maintain the observability platform using Grafana, Prometheus, Loki, CloudWatch, and related monitoring tools.
Define Service Level Objectives (SLOs), error budgets, and reliability metrics across multiple products.
Build intelligent alerting and monitoring solutions that provide actionable diagnostic information.
Design incident response processes, escalation procedures, and post-incident review practices.
Implement safe automated remediation for well-understood operational scenarios while ensuring human oversight for complex incidents.

Platform Expansion & Integration

Assess newly onboarded products for infrastructure maturity, Infrastructure as Code coverage, observability, and security.
Plan and execute platform integration and modernization initiatives while minimizing operational disruption.
Support the adoption of standardized platform capabilities across multiple engineering teams.

Engineering Automation

Leverage AI-assisted engineering tools and automation where appropriate to reduce operational overhead.
Automate infrastructure provisioning, CI/CD workflows, monitoring, secrets management, and operational tasks while maintaining engineering oversight for high-impact decisions.

Preferred Technology Stack

AWS
Terraform / OpenTofu
GitHub Actions
Grafana
Prometheus
Loki
AWS CloudWatch
AWS Secrets Manager or HashiCorp Vault
Amazon ECS and EKS
Event streaming technologies
Cost monitoring and cloud optimization tools

Requirements

8–12 years of experience in Platform Engineering, Site Reliability Engineering (SRE), DevOps, or Cloud Infrastructure Engineering.
Proven experience designing and operating production platform infrastructure across multiple environments or products.
Strong hands-on experience with Terraform (or OpenTofu) and Infrastructure as Code.
Extensive experience designing and maintaining CI/CD pipelines using GitHub Actions.
Experience operating event streaming infrastructure in production environments.
Strong AWS expertise, including ECS, EKS, IAM, VPC, RDS, CloudWatch, networking, and cloud infrastructure.
Hands-on experience with Grafana, Prometheus, Loki, and enterprise observability platforms.
Strong understanding of SRE principles, including SLOs, error budgets, incident response, and operational excellence.
Experience designing scalable, secure, highly available cloud infrastructure.
Strong troubleshooting, automation, and problem-solving skills.
Excellent communication skills with the ability to establish engineering standards across multiple teams.

Nice to Have

Experience building shared platform engineering capabilities supporting multiple products or business units.
Experience integrating newly acquired products or modernizing legacy platforms.
Experience designing developer self-service platforms.
Familiarity with AI-assisted engineering workflows and infrastructure automation.
Experience supporting high-volume enterprise SaaS products and distributed systems.
Strong focus on cloud cost optimization and operational efficiency.

What We Offer

Competitive market salary.
Fully remote work.
Opportunity to build and shape the engineering platform supporting a growing portfolio of enterprise SaaS products.
Work alongside experienced international engineering teams.
Exposure to modern cloud technologies, AI-assisted engineering, automation, and large-scale platform initiatives.
Professional growth through ownership of platform architecture, operational reliability, and engineering standards.
Daily collaboration with a U.S.-based engineering team, with availability required until at least 3:00 PM EST and flexibility to work longer when needed.

Required languages

English C2 - Proficient

Published 30 June

113 views

15 applications

Response activity: Low

Last responded 2 weeks ago

See stats of candidates who applied for this job 👀

See applicant insights

To apply for this and other jobs on Djinni login or signup.

Only from 8 years of experience
Full Remote
Worldwide
Countries where we consider candidates
- English C2 - Proficient

DevOps

Employment: Fulltime
Domain: SaaS
Outstaff

Response activity: Low

Last responded 2 weeks ago

📊 $4000-6000 Average salary range of similar jobs in analytics →