Site Reliability Engineer (SRE) - Gambling Product Team $5000-6000 (offline)
Employment Type: Full-Time, 100% Remote
Language Requirement: Upper-Intermediate English
Job Overview:
We are seeking a skilled Site Reliability Engineer (SRE) to join our Gambling Product Team. In this role, you will be instrumental in ensuring the optimal performance of our cloud applications and AWS cloud infrastructure. You'll collaborate with product engineering, cloud architecture, DevOps, and DevSecOps teams to enhance our system reliability and efficiency.
Key Responsibilities:
• Monitor, and enforce service-level agreements (SLAs) and service-level indicators (SLIs).
• Handle and respond to service outages and interruptions. This includes troubleshooting, root cause analysis, and post-mortem reviews to prevent future incidents.
• Monitor the infrastructure and application's performance to predict future system demands. This includes provisioning additional resources or optimizing the existing setup to handle the load.
• Completes complex development, design, implementation, architecture design specification, and maintenance activities as needed.
• Automate manual operations work, including the deployment of code and configuration changes.
• Set up and maintain monitoring, logging, and alerting systems.
• Monitor and analyze infrastructure costs to suggest ways to optimize and reduce unnecessary expenses.
• Identify and remove bottlenecks in the system to improve performance. This might involve code optimizations, database tuning, or optimizing server configurations.
• Build software and systems to manage platform infrastructure and applications
• Provide operational support and engineering for multiple large distributed software applications
• Fixing escalated issues from development team
• Document best practices, runbooks, and procedures for troubleshooting common issues.
• Improve reliability, quality, and time-to-market of our suite of software solutions
• Measure and optimize system performance, pushing our capabilities forward, getting ahead of product team needs, and innovating to continually improve
• Work collaboratively with product & software engineering professionals to define infrastructure and deployment requirements.
• Provision, configure and maintain cloud infrastructure defined as code.
• Ensure that the infrastructure and applications meet security standards and comply with relevant regulations. This might involve regular security audits, patching, and vulnerability assessments.
• Troubleshoot problems across a wide array of services and functional areas.
• Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
• Partner with development teams to improve services through rigorous testing and release procedures
• Create sustainable systems and services through automation and uplifts
• Strong organizational skills, customer service focus, attention to detail, and process orientation
• Consistent and regular attendance including on-call availability on a rotational basis is an essential function of this job
Qualifications
• Bachelor’s degree or equivalent in relevant discipline.
• 5 years of experience building and maintaining AWS infrastructure (VPC, EC2, Security Groups, IAM, ECS, CodeDeploy, CloudFront, S3)
• Strong understanding of how to secure AWS environments and meet compliance requirements
• Hands-on experience deploying and managing infrastructure with Terraform
• Experience with Kubernetes, GitHub, Jenkins, ELK and deploying applications on AWS
• Ability to learn/use a wide variety of open-source technologies and tools
• Ability to program (structured and OO) with one or more high level languages, with a strong preference for GoLang
• Experience with distributed storage technologies like NFS, HDFS, Ceph, S3 as well as dynamic resource management frameworks (Mesos, Kubernetes, Yarn)
• Must be able to work varied shifts, including nights, weekends and holidays.