Cornerstone is looking for a Cloud Site Reliability Engineer (Cloud SRE) to be part of Global SRE Team where you will help build a meaningful engineering discipline, combining software and systems to develop creative engineering solutions to operations problems.
We are one of the largest SaaS companies on the planet with more than 10 Million business users on our application. This critical position provides great visibility and individual growth potential for an ambitious IT professional, but also requires proven experience with exponential growth on a global scale. You will join a team of curious problem solvers with a diverse set of perspectives who are thinking big and taking risks.
In this environment, you will take the lead on relevant projects, supported by an organization that provides the support and mentorship you need to learn and grow. The Cloud SRE will work closely with other teams in Techops and Software Development organizations in support of building and maintaining Cloud datacenters and services.
In this role you will work on the following:
Own the overall health and performance of our infrastructure in AWS that serves our customer-facing applications and services. (This would include all AWS accounts created through our Region-as-Code initiative but not the domain team services hosted within. Some of the core components include Route53, Transit Gateway, and IAM)
Develop, test and debug automated tasks (Apps, Systems, Infrastructure) and manage orchestration tools like Jenkins and AWS lambda/Systems Manager.
Troubleshoot priority incidents and facilitate blameless post-mortems
Perform analytics on previous incidents and usage patterns to better predict issues and take proactive actions
Work with development teams throughout the software life cycle ensuring sustainable software releases
Build and drive adoption for greater self-healing and resiliency patterns
Lead and participate in performance tests; identify bottlenecks, opportunities for optimization, and capacity demands
Design and build reliable, fault tolerant cloud infrastructure following industry best practices.
Perform hands on management of cloud infrastructure and document cloud infrastructure and policies
Understand on premise policies, solutions, and technologies and integrate with the cloud infrastructure where applicable.
Understanding cloud security best practices and work with Security teams to design and implement a security infrastructure
Serve as a technical point of contact for SRE and help with communication of SRE Principles to projects
An example would be to re-factor individual account AWS SNS topic subscriptions used for alerting to a centralized model in a shared account.
Another example would be to develop a way to safely handle configuration drift in the foundational items such as the VPC and subnet configuration in an AWS account using our internally developed pipeline tooling.
All infrastructure is deployed through Jenkins workflows developed in Groovy script that use AWS native technologies such as CloudFormation and boto3 to create the AWS accounts and foundational services. Challenges include how to handle configuration drift in older accounts provisioned through an automation framework that has rapidly developed since the beginning of the project. Another challenge is to continually improve (decrease) the time it takes to provision a new AWS account, apply all foundational components, and hand it off to the consumer.
We expect you to have:
3 years of solid AWS experience along with Amazon Web Services (AWS) Solutions Architect – Associate, Professional (would be a good addition)
Mastery in at least two or more software languages (e.g., Python, Groovy, PowerShell etc.) with respect to designing, coding, testing, and software delivery
Adept in the development of automated tools (e.g., Jenkins, Ansible, etc.), systems, and services in multiple technology domains
Advanced knowledge of infrastructure components (e.g., networking, DNS, cloud services, orchestration tools, containerization, compute, and storage systems)
Proficiency in service-level changes to a system and troubleshooting components
Experience with Splunk or other log aggregation/monitoring tools
Experience in engineering solutions for metrics gathering/publishing and event collection/correlation across distributed architectures, automation, monitoring, intelligent alerting, random fault injections (Chaos Engineering), and self-healing
Experience in a highly regulated, standards-compliant, production environment (SOC, ISO, etc.)
5+ years of overall IT experience and 3 years of scripting/automation experience
Excellent interpersonal skills to interact with senior-level personnel and team members
Ability to work well both independently and in teams
Ability to multi-task and to prioritize rapidly-changing task assignments
Experience working in a fast-paced and deadline-oriented environment
Excellent organization and communication skills, both written and verbal
About Freelancer Iryna Sevastianova
Здравствуйте, меня зовут Ирина.
Я фриланс-ректрутер работаю на продуктовые компании, среди моих клиентов так же есть и уютные сервисные компании.
Работаю по-человечески с душой подхожу к каждому кандидату.
Все актуальные вакансии на этой странице!
Job posted on
7 July 2021