Middle Site Reliability Engineer (SRE)
We are seeking skilled and dedicated Site Reliability Engineer (SRE) to join our dynamic team. As an SRE, you will play a crucial role in maintaining and improving the reliability, performance, and scalability of our services. Your primary focus will be on developing production automations for incident response and semi-automated issue triage, maintaining infrastructure using Infrastructure as Code (IaC) tools like Terraform, and developing and monitoring our observability platforms using Prometheus and OpenTelemetry (OTel).
Key Responsibilities:
Automation Development:
- Design, implement, and maintain automated solutions for incident response and semi-automated issue triage.
- Design, implement, and maintain automated E2E pipeline testing as part of observability tools
- Develop scripts and tools to enhance operational efficiency and reduce manual intervention.
- Collaborate with steam-aligned and operations teams to identify automation opportunities.
Infrastructure as Code (IaC):
- Maintain and manage infrastructure using IaC tools, primarily Terraform.
- Ensure consistent and repeatable deployments of infrastructure and services.
- Conduct regular reviews and updates to IaC configurations to optimize performance and cost.
Observability and Monitoring:
- Develop, implement, and maintain observability platforms using Prometheus and OpenTelemetry (OTel).
- Set up and configure monitoring, alerting, and logging systems to ensure comprehensive visibility into system health and performance.
- Analyze metrics and logs to identify trends, potential issues, and areas for improvement.
- Design, implement, and maintain APM tools & dashboards
- Participate in RnD SLI, SLO and SLA creation and monitoring.
Incident Management and Response:
- Respond to and resolve incidents promptly to minimize downtime and impact.
- Act as T2 response team
- Conduct post-incident reviews and implement improvements to prevent recurrence.
- Develop and maintain runbooks and documentation for incident response and
- Create After Action Reports (AAR) to be sent to our customers after incidents
automation processes.
Collaboration and Communication:
- Work closely with development teams to integrate reliability and performance considerations into the development lifecycle.
- Communicate effectively with stakeholders regarding system status, incidents, and improvements.
- Proven experience as a Site Reliability Engineer or similar role.
- Strong proficiency in Infrastructure as Code (IaC) tools, particularly Terraform.
- Experience with automation scripting using Python (Must), Golang ( preferred), Bash or any other scripting language.
- In-depth knowledge of observability tools such as Prometheus and OpenTelemetry (OTel).
- Understanding of system architecture, networking, and cloud infrastructure (e.g., AWS, Azure, GCP). Call rotations and ensure proper handover and documentation.
Qualifications:
- Excellent problem-solving skills and the ability to work under pressure.
- Strong communication and collaboration skills.
- Experience with containerization and orchestration tools (e.g., Docker, Kubernetes).
- Familiarity with CI/CD pipelines and related tools (e.g., Jenkins, GitLab CI).
- Knowledge of security best practices and compliance standards.
Preferred Qualifications:
- Experience with database management and optimization.
- Familiarity with microservices architecture and related technologies.
- Familiarity with SQL and non-sql databases, postgres and mongodb.
Required skills experience
| AWS | |
| Azure | |
| Prometheus | |
| OpenTelemetry | |
| Terraform |
+ 4 more
| Kubernetes | |
| Docker | |
| Jenkins | |
| Python |
Required languages
| English | B2 - Upper Intermediate |
Gitlab CI, Bash, GCP, GoLang, scripting, CI/CD
๐
Average salary range of similar jobs in
analytics โ
Loading...