Site Reliability Engineer (SRE)
Our client is a remote-first, dynamic international product company in the iGaming field. Currently we’re on the lookout for an experienced Site Reliability Engineer (SRE) for their team.
RESPONSIBILITIES:
- Development and implementation of monitoring, alerting and metrics processes.
- Participation in eliminating failures and investigating causes;
- Improving Application Observability level;
- Design, implementation and support of metrics for different levels of monitoring;
- Participation in the design and implementation of fault-tolerant application architecture;
- Organization of rapid response processes to incidents (Incident Response);
- Participating in Disaster Recovery strategies development;
- Introduction and popularization of post-incident meetings (Postmortem);
- Conducting root cause analysis (RCA) to prevent recurrence of incidents;
- Building and automation of automatic problem response systems;
- Participation in logging management on the product;
- Creation of reliability documentation and guides;
- Ownership of processes and approaches to support the uninterrupted operation of services.
REQUIREMENTS:
- At least 2+ years of experience in the SRE position;
- Overall experience of 5+ years in the Infrastructure management;
- Linux/Unix environment;
- Strong expertise with containerised workloads and services such as Docker, Docker Swarm and Kubernetes;
- Experience with Cloud solutions (AWS, GCP, or any other);
- Automatic deployment tools such as Ansible, Terraform;
- Proven expertise with databases as: MySQL, PostgresSQL, Redis, Mongo (Clickhouse is a plus);
- Monitoring tools (Grafana/Prometeus/CloudWatch etc.);
- Logging tools (ELK stack etc.);
- Networking: network topologies and common network protocols and services (TCP/IP, DNS, HTTP(S), SSH, SMTP, IPMI, L2/L3 layers);
- Message brokers such as Kafka, RabbitMQ etc;
- Robust skills with CI/CD tools such as Gitlab CI;
- Shell (Bash), Any relevant programming languages, e.g. Golang, Python is a big plus;
- Storage/filesystems: nfs, RBD, Ceph, ext*, xfs, raid*;
- Version Control: Experience administrating version control systems such as GIT.
WE OFFER:
- Possibility of a remote work from anywhere in the world
- Generous days-off policy (vacation, sick leave, days off, holidays)
- Guaranteed performance reviews & career plan development
- Low bureaucracy level, with decisions made quickly
- Open-minded and easy-going management
- Friendly atmosphere among people who love their work.