Site Reliability engineer

Responsobilities: 

  • Maintain and improve existing monitoring configurations (alerts, dashboards, service discovery, scrape configs, etc.)
  • Implement and enhance alerting logic, including threshold tuning and dynamic alert conditions
  • Troubleshoot monitoring and metrics-related issues (e.g., missing data, false alerts, broken dashboards)
  • Support and improve self-developed metrics collectors and Python-based monitoring services
  • Assist NOC and SRE teams with alert deduplication, escalation rules, and alert quality improvements
  • Participate in design and implementation of observability improvements for new services and infrastructure components
  • Review, modify, and extend existing scripts and plugins (primarily Python and Bash)
  • Provide monitoring-related guidance to development, infrastructure, and operations teams
  • Ensure monitoring tools and services operate reliably within Kubernetes clusters and Linux systems
  • Maintain monitoring configuration in Git and follow internal version control best practices
  • Participate in cross-team initiatives to improve the overall monitoring and incident response ecosystem

 

Requirements: 

  • Strong hands-on experience with Linux systems (primarily Ubuntu)
  • Practical knowledge of Prometheus ecosystem, VictoriaMetrics, Grafana, and Zabbix
  • Experience supporting monitoring systems in Kubernetes-based infrastructure
  • Solid scripting skills (Bash)
  • Familiarity with Git and common version control workflows
  • Good understanding of networking and infrastructure concepts (ports, protocols, DNS, etc.)
  • Ability to troubleshoot metric collection, alert firing, and data visualization issues
  • Basic knowledge of SQL (e.g., for querying time-series or metadata stores)
  • Strong communication skills for cross-functional collaboration

 

Nice to have: 

  • Understanding of high-availability and failover patterns in observability systems
  • Experience working with SLO/SLA-based alerting or anomaly detection mechanisms
  • Exposure to automation and CI/CD pipelines for monitoring infrastructure

Required languages

English B2 - Upper Intermediate
Linux, Kubernetes, Prometheus+Grafana
Published 1 September
51 views
ยท
3 applications
67% read
ยท
0% responded
To apply for this and other jobs on Djinni login or signup.
Loading...