Middle Site Reliability Engineer (SRE)

We are seeking skilled and dedicated Site Reliability Engineer (SRE) to join our dynamic team. As an SRE, you will play a crucial role in maintaining and improving the reliability, performance, and scalability of our services. Your primary focus will be on developing production automations for incident response and semi-automated issue triage, maintaining infrastructure using Infrastructure as Code (IaC) tools like Terraform, and developing and monitoring our observability platforms using Prometheus and OpenTelemetry (OTel).

Key Responsibilities:


Automation Development:

  • Design, implement, and maintain automated solutions for incident response and semi-automated issue triage.
  • Design, implement, and maintain automated E2E pipeline testing as part of observability tools
  • Develop scripts and tools to enhance operational efficiency and reduce manual intervention.
  • Collaborate with steam-aligned  and operations teams to identify automation opportunities.

 

Infrastructure as Code (IaC):

  • Maintain and manage infrastructure using IaC tools, primarily Terraform.
  • Ensure consistent and repeatable deployments of infrastructure and services.
  • Conduct regular reviews and updates to IaC configurations to optimize performance and cost.

 

Observability and Monitoring:

  • Develop, implement, and maintain observability platforms using Prometheus and OpenTelemetry (OTel).
  • Set up and configure monitoring, alerting, and logging systems to ensure comprehensive visibility into system health and performance.
  • Analyze metrics and logs to identify trends, potential issues, and areas for improvement.
  • Design, implement, and maintain APM tools & dashboards
  • Participate in RnD SLI, SLO and SLA creation and monitoring.

 

Incident Management and Response:

  • Respond to and resolve incidents promptly to minimize downtime and impact.
  • Act as T2 response team
  • Conduct post-incident reviews and implement improvements to prevent recurrence.
  • Develop and maintain runbooks and documentation for incident response and
  • Create After Action Reports (AAR) to be sent to our customers after incidents
    automation processes.

 

Collaboration and Communication:

  • Work closely with development teams to integrate reliability and performance considerations into the development lifecycle.
  • Communicate effectively with stakeholders regarding system status, incidents, and improvements.
  • Proven experience as a Site Reliability Engineer or similar role.
  • Strong proficiency in Infrastructure as Code (IaC) tools, particularly Terraform.
  • Experience with automation scripting using Python (Must), Golang ( preferred), Bash or any other scripting  language.
  • In-depth knowledge of observability tools such as Prometheus and OpenTelemetry (OTel).
  • Understanding of system architecture, networking, and cloud infrastructure (e.g., AWS, Azure, GCP). Call rotations and ensure proper handover and documentation.

 

Qualifications:

  • Excellent problem-solving skills and the ability to work under pressure.
  • Strong communication and collaboration skills.
  • Experience with containerization and orchestration tools (e.g., Docker, Kubernetes).
  • Familiarity with CI/CD pipelines and related tools (e.g., Jenkins, GitLab CI).
  • Knowledge of security best practices and compliance standards.

 

Preferred Qualifications:

  • Experience with database management and optimization.
  • Familiarity with microservices architecture and related technologies.
  • Familiarity with SQL and non-sql databases, postgres and mongodb.

 

Required skills experience

AWS
Azure
Prometheus
OpenTelemetry
Terraform
Kubernetes
Docker
Jenkins
Python

Required languages

English B2 - Upper Intermediate
Gitlab CI, Bash, GCP, GoLang, scripting, CI/CD
Published 27 November
64 views
ยท
11 applications
82% read
ยท
82% responded
Last responded yesterday
To apply for this and other jobs on Djinni login or signup.
Loading...