The company develops commercial products, both in-house and as a consultant/solutions vendor for other companies.
Active, friendly, small teams. Opportunities for formal and informal training.
-
· 177 views · 38 applications · 8d
Senior Site Reliability Engineer (SRE/DevOps)
Countries of Europe or Ukraine · 5 years of experience · English - NoneWe are seeking skilled and dedicated Site Reliability Engineer (SRE) to join our dynamic team. As an SRE, you will play a crucial role in maintaining and improving the reliability, performance, and scalability of our services. Your primary focus will be...We are seeking skilled and dedicated Site Reliability Engineer (SRE) to join our dynamic team. As an SRE, you will play a crucial role in maintaining and improving the reliability, performance, and scalability of our services. Your primary focus will be on developing production automations for incident response and semi-automated issue triage, maintaining infrastructure using Infrastructure as Code (IaC) tools like Terraform, and developing and monitoring our observability platforms using Prometheus and OpenTelemetry (OTel).
Key Responsibilities:
Automation Development:- Design, implement, and maintain automated solutions for incident response and semi-automated issue triage.
- Design, implement, and maintain automated E2E pipeline testing as part of observability tools
- Develop scripts and tools to enhance operational efficiency and reduce manual intervention.
- Collaborate with steam-aligned and operations teams to identify automation opportunities.
Infrastructure as Code (IaC):
- Maintain and manage infrastructure using IaC tools, primarily Terraform.
- Ensure consistent and repeatable deployments of infrastructure and services.
- Conduct regular reviews and updates to IaC configurations to optimize performance and cost.
Observability and Monitoring:
- Develop, implement, and maintain observability platforms using Prometheus and OpenTelemetry (OTel).
- Set up and configure monitoring, alerting, and logging systems to ensure comprehensive visibility into system health and performance.
- Analyze metrics and logs to identify trends, potential issues, and areas for improvement.
- Design, implement, and maintain APM tools & dashboards
- Participate in RnD SLI, SLO and SLA creation and monitoring.
Incident Management and Response:
- Respond to and resolve incidents promptly to minimize downtime and impact.
- Act as T2 response team
- Conduct post-incident reviews and implement improvements to prevent recurrence.
- Develop and maintain runbooks and documentation for incident response and
- Create After Action Reports (AAR) to be sent to our customers after incidents
automation processes.
Collaboration and Communication:
- Work closely with development teams to integrate reliability and performance considerations into the development lifecycle.
- Communicate effectively with stakeholders regarding system status, incidents, and improvements.
- Proven experience as a Site Reliability Engineer or similar role.
- Strong proficiency in Infrastructure as Code (IaC) tools, particularly Terraform.
- Experience with automation scripting using Python (Must), Golang ( preferred), Bash or any other scripting language.
- In-depth knowledge of observability tools such as Prometheus and OpenTelemetry (OTel).
- Understanding of system architecture, networking, and cloud infrastructure (e.g., AWS, Azure, GCP). Call rotations and ensure proper handover and documentation.
Qualifications:
- Excellent problem-solving skills and the ability to work under pressure.
- Strong communication and collaboration skills.
- Experience with containerization and orchestration tools (e.g., Docker, Kubernetes).
- Familiarity with CI/CD pipelines and related tools (e.g., Jenkins, GitLab CI).
- Knowledge of security best practices and compliance standards.
Preferred Qualifications:
- Experience with database management and optimization.
- Familiarity with microservices architecture and related technologies.
- Familiarity with SQL and non-sql databases, postgres and mongodb.
More