Site Reliability Engineer
Technical Requirements:
- Platform & Infrastructure GCP, Terraform (IaC), Kubernetes, Service Mesh (Istio / Linkerd), Linux distributed systems at scale
- GitOps & CI/CD ArgoCD, GitHub Actions, GitOps principles, Progressive delivery
- Language & Observability Go (Golang), Prometheus / Grafana, Distributed tracing, SQL / NoSQL at scale
- Competencies Cloud native mindset ยท Automate Everything ยท Speaks developers language ยท Fantastic communication ยท SLO / error quota ownership
- Nice to Have Helm, eBPF, Vault / Crossplane, OpsGenie / PagerDuty
Responsibilities:
-Eliminate toil through automation, re-architecting, and refactoring โ not just patching symptoms.
-Approach every incident with an "Automate Everything" mindset so the same problem never fires twice.
-Pair with software engineers to troubleshoot and resolve production incidents down to root cause.
-Drive complex infrastructure changes with full transparency, clear communication, and zero downtime.
-Design and implement self-healing, reliable, and scalable infrastructure in a cloud- native environment.
-Guide and unblock developers across multiple teams so they can keep shipping with confidence.
-Define SLOs and error quotas for production services; own and manage the error budget.
-Own the GitOps workflow via ArgoCD โ every deployment is Git-defined, automated, and reproducible.
-Write or review postmortems after incidents; track corrective actions to completion. Participate in the follow-the-sun on-call rota and actively champion our DevOps culture.
What does an average day look like?
You'll proactively support production workloads, troubleshoot issues to their root cause, and write or review postmortems once incidents are resolved. You'll continuously identify weaknesses in infrastructure and observability and feed them into the improvement backlog.
Required domain experience
| Healthcare / MedTech | 1 year |
Required languages
| English | C1 - Advanced |
| Ukrainian | Native |
| Russian | Native |