Senior Site Reliability Engineer with GoLang to $6000 Offline

 

Hello. Today we are looking for:

Senior Site Reliability Engineer with GoLang

Summary
• Design, implement, and maintain highly available and resilient architectures for IP4G workloads on Google Cloud Platform, leveraging fault-tolerant designs and redundancy strategies.
• Monitor system performance, availability, and reliability metrics to proactively identify and address potential issues before they impact service uptime or performance.
• Implement disaster recovery solutions and failover mechanisms to ensure business continuity and minimize service disruptions.
• Optimize IP4G workloads for performance, scalability, and cost-efficiency in the Google Cloud environment, leveraging auto-scaling, load balancing, and caching strategies.
• Conduct capacity planning exercises and performance tuning activities to ensure optimal resource utilization and performance of IP4G systems and applications.
• Collaborate with adjacent teams (Product Mgmt., Software Development, Engineering, Operations) to implement CI/CD pipelines and automation workflows for seamless deployment and scaling of IP4G workloads.
• Respond to and resolve critical incidents impacting the availability or performance of IP4G systems and applications on Google Cloud, following established incident response procedures and SLAs.
• Document incident response procedures, post-mortem reports, and lessons learned to improve incident management processes and enhance system reliability.
• Develop automation scripts and infrastructure as code (IaC) templates to automate routine tasks, streamline deployment processes, and improve operational efficiency.
• Continuously evaluate and adopt emerging technologies and best practices in automation and orchestration to enhance the reliability and scalability of the platform.
• Implement comprehensive monitoring and alerting solutions for IP4G workloads on Google Cloud, utilizing monitoring tools such as, but not limited to Prometheus, and Grafana.
• Define and configure alerting thresholds, notifications, and escalation policies to ensure timely detection and response to anomalous behavior or performance degradation.
• Analyze monitoring data and performance metrics to identify trends, patterns, and areas for optimization, and proactively implement remediation measures.

 

The distilled fundamental things we must see are:
• Expertise and experience designing, building, operating and scaling Kubernetes Clusters and Environments
• Expertise and experience implementing and maintaining CI/CD pipelines and automation workflows for seamless deployment and scaling
• Respond to and resolve critical incidents impacting the availability or performance of systems and applications
• Develop automation scripts and infrastructure as code (IaC) templates to automate routine tasks, streamline deployment processes, and improve operational efficiency
• Implement comprehensive monitoring and alerting solutions utilizing monitoring tools such as Stackdriver, Prometheus, and Grafana (or similar)
• Excellent verbal and written communication skills.
• Demonstrable Systems and Design thinking.
• Excellent analytical and problem-solving skills, with the ability to troubleshoot complex technical issues in a dynamic, fast-paced environment

The job ad is no longer active

Look at the current jobs Golang →