Site Reliability Engineer Offline
We're currently looking for SRE to significantly improve the quality of monitoring and observability.
Our core infrastructure is primarily based on bare-metal servers to deliver maximum performance and reliability. We run our applications in Kubernetes clusters to ensure scalability and streamlined deployment processes. Cloud solutions are integrated selectively to enhance flexibility and complement our primary infrastructure.
What will you do in this role:
- Manage, optimize, and maintain high availability of monitoring and observability tools.
- Automate alerts and incident response to reduce manual work.
- Enhance escalation processes for efficient issue resolution.
- Keep monitoring systems up to date and improve coverage.
About you:
- Strong experience in monitoring and observability
- Proficiency in Prometheus and Grafana
- Programming skills (Python would be a plus)
- Networking skills
- Experience with administering web applications
Would be a plus:
- Experience with bare-metal infrastructure
- Experience with ELK or similar logging stacks
- Experience with Nagios or its derivatives
- Experience with tracing (Jaeger)