Site Reliability Engineer Offline

We're currently looking for SRE to significantly improve the quality of monitoring and observability.

Our core infrastructure is primarily based on bare-metal servers to deliver maximum performance and reliability. We run our applications in Kubernetes clusters to ensure scalability and streamlined deployment processes. Cloud solutions are integrated selectively to enhance flexibility and complement our primary infrastructure.

What will you do in this role: 
- Manage, optimize, and maintain high availability of monitoring and observability tools.
- Automate alerts and incident response to reduce manual work.
- Enhance escalation processes for efficient issue resolution.
- Keep monitoring systems up to date and improve coverage.

About you: 
- Strong experience in monitoring and observability
- Proficiency in Prometheus and Grafana
- Programming skills (Python would be a plus)
- Networking skills
- Experience with administering web applications 

Would be a plus: 
- Experience with bare-metal infrastructure
- Experience with ELK or similar logging stacks
- Experience with Nagios or its derivatives
- Experience with tracing (Jaeger)

To apply for this and other jobs on Djinni login or signup.