Site Reliability Engineer
The Opportunity
We are seeking a Site Reliability Engineer who will be responsible for ensuring the reliability, scalability, and performance of critical infrastructure systems and driving best practices in operational excellence. You will take a leadership role in implementing automation, optimizing system monitoring, troubleshooting complex issues, and enhancing the security and reliability of our IT systems. The ideal candidate will be a technical leader with deep expertise in managing large-scale infrastructure and a strong drive for improving operational efficiency.
Office hours and location
- Full-time, Monday to Friday
- Location: Argyle, TX – Onsite
Duties and Responsibilities
- Take ownership of the design, implementation, and maintenance of the organization’s infrastructure and systems, ensuring their optimal performance and reliability. This includes managing self-hosted systems, performing routine server maintenance, and conducting system upgrades.
- Lead and manage the resolution of complex incidents, ensuring minimal downtime and disruption. Drive the incident response process, including root cause analysis, post-mortem reporting, and implementing preventative measures to improve system reliability.
- Implement and optimize system monitoring and alerting systems across the infrastructure. Proactively monitor systems and services to identify and resolve potential issues before they affect end-users, ensuring high levels of system uptime and availability.
- Drive automation initiatives to streamline operational tasks, including server provisioning, configuration management, and incident response processes. Lead the effort to build and enhance internal automation tools and scripts to improve operational efficiency.
- Collaborate with the security team to improve security posture by enforcing security best practices, ensuring secure access management, and automating security patching and vulnerability management. Contribute to compliance initiatives by ensuring systems meet industry standards and regulations.
- Work closely with development teams, DevOps, and other cross-functional teams to implement and maintain reliable systems and ensure the smooth deployment of applications. Participate in the design and development of highly reliable, scalable, and fault-tolerant systems.
- Oversee and support the provisioning, de-provisioning, and management of user accounts and SaaS platforms. Work closely with IT and HR teams to ensure seamless access management during onboarding and offboarding.
- Design, implement, and test disaster recovery and backup processes to ensure data integrity and availability. Continuously improve these processes to minimize recovery time objectives (RTO) and recovery point objectives (RPO).
- Create and maintain comprehensive documentation for systems, processes, and incident response procedures. Share best practices and lessons learned through knowledge-sharing sessions, ensuring high levels of collaboration across teams.
- Stay current with the latest technologies and best practices in cloud computing, containerization, automation, and site reliability engineering. Lead initiatives to evaluate, implement, and integrate new technologies that can improve system performance and reduce operational costs.
Required Experience and Job Qualifications
- 3+ years of experience in Site Reliability Engineering, DevOps, or IT operations with significant hands-on experience in managing large-scale systems.
- Proficient with Linux/Unix systems, particularly RHEL and Ubuntu-like distributions.
- Experience with cloud infrastructure, primarily AWS, and the ability to manage and troubleshoot cloud-based environments.
- Strong understanding of networking concepts, such as TCP/IP, DNS, and HTTP/HTTPS, and hands-on experience troubleshooting network connectivity issues.
- Hands-on experience with tools like Apache, Nginx, MySQL, PostgreSQL, Docker, Kubernetes, Zabbix, Ansible, Puppet, and Terraform.
- Experience with infrastructure automation tools (e.g., Ansible, Puppet, Chef, Terraform) and configuration management best practices.
- Proficient in scripting and automation (e.g., Python, Bash, or Ruby).
- Experience with monitoring tools (e.g., Prometheus, Grafana, Datadog) and incident management platforms.
- Strong experience in troubleshooting complex incidents and system failures.
- Good understanding of security best practices, including access management, vulnerability scanning, and system hardening.
- Experience with disaster recovery, backup, and high availability solutions.
About us
Aircraft Performance Group (APG)
Aircraft Performance Group, LLC (APG) is a flight operations performance engineering firm, established in 1999, that specializes in Runway Analysis, Weight and Balance, and Flight Planning solutions for the airline and corporate flight operations industry. We maintain a current worldwide database of airport information and provide data based on FAR, EASA, and CASA requirements. APG is headquartered in Lone Tree, Colorado. Learn more at flyapg.com.
Required languages
| English | C1 - Advanced |
| Ukrainian | Native |