Senior DevOps Engineer
CodeSmart is looking for a hands-on, systems-minded Senior DevOps Engineer to own and evolve our cloud infrastructure, deployment pipelines, and runtime reliability. You’ll work across AWS, Terraform, Rancher/Kubernetes, and Linux to build highly available, production-grade systems that can scale with our products.
This role is ideal for someone who enjoys designing resilient architectures, automating everything, and partnering closely with backend/frontend engineers (Python, Java, and JavaScript stacks) to ship safely and operate confidently.
Responsibilities
- Design and operate AWS infrastructure for scalable, secure, and highly available production environments.
- Build and maintain Infrastructure-as-Code using Terraform (modules, state management, workspaces, best practices).
- Manage container orchestration and workloads using Rancher (Kubernetes lifecycle, clusters, upgrades, policies).
- Own CI/CD pipelines (build, test, deploy), release workflows, and environment promotion strategies.
- Implement reliability practices: monitoring, logging, alerting, incident response, postmortems, SLOs/SLIs.
- Design high availability and disaster recovery approaches (multi-AZ, backups, failover, runbooks).
- Manage Linux systems and networking: hardening, performance tuning, troubleshooting, OS-level automation.
- Support and optimize VM-based workloads where needed (images, scaling, patching, sizing, cost control).
- Partner with engineering teams to improve operability for Python/Java/JS services (deploy patterns, configs, secrets, rollback).
- Establish secure secrets management, IAM best practices, and least-privilege access across environments.
- Drive continuous improvement in infrastructure security, stability, and developer experience.
Requirements
- 6+ years of commercial DevOps / SRE / Platform Engineering experience (or equivalent).
- Strong AWS experience across core services (e.g., VPC, EC2, IAM, ALB/NLB, RDS, S3, CloudWatch) and production operations.
- Strong Terraform experience: reusable modules, remote state, state locking, drift detection, and multi-environment setups.
- Hands-on Rancher experience managing Kubernetes clusters in production (upgrades, node pools, ingress, RBAC, policies).
- Expert Linux skills: debugging, networking basics, system performance, permissions, automation, and security hardening.
- Proven experience designing and operating high availability systems (multi-AZ, failover, redundancy, DR planning).
- Solid understanding of virtual machines: compute sizing, images, networking, storage, patching, and lifecycle management.
- Strong scripting/automation skills with a Python stack for DevOps (automation tooling, CLI scripts, operational utilities).
- Working knowledge of Java stack operations (JVM service deployment patterns, tuning basics, observability, rollout/rollback).
- Working knowledge of JS stack operations (Node.js services, build/release, environment config, runtime monitoring).
- Experience with containers and Kubernetes fundamentals: deployments, services, ingress, configmaps/secrets, autoscaling.
- Familiarity with secure SDLC practices: secrets, IAM policies, vulnerability scanning, least privilege, audit readiness.
- Strong troubleshooting mindset with demonstrated production incident ownership.
- Excellent communication skills and ability to collaborate in a distributed team.
Nice to Have
- Experience with GitOps (e.g., Argo CD / Flux), Helm, and progressive delivery (blue/green, canary).
- Experience with centralized logging/metrics stacks (Prometheus, Grafana, ELK/OpenSearch).
- Experience optimizing cloud costs (rightsizing, reserved instances/savings plans, storage lifecycle policies).
- Experience with service mesh / advanced Kubernetes networking (as relevant to the environment).
- Prior experience supporting AI/LLM-heavy workloads (bursty traffic, queueing, cost monitoring, reliability patterns).
What to Include (Helps Us Review Faster)
- A short write-up of a production system you operated (AWS + Terraform + Rancher), including HA/DR approach.
- Examples of incident ownership (what happened, how you mitigated, what you improved afterward).
- Links to IaC examples (Terraform modules), pipeline examples, or public repos (if available).
Required languages
| English | C1 - Advanced |
CI/CD, Prometheus+Grafana, DevOps, AWS, Git
Published 19 January
20 views
·
5 applications
40% read
📊
Average salary range of similar jobs in
analytics →
Loading...