Senior DevOps Engineer

CodeSmart is looking for a hands-on, systems-minded Senior DevOps Engineer to own and evolve our cloud infrastructure, deployment pipelines, and runtime reliability. You’ll work across AWS, Terraform, Rancher/Kubernetes, and Linux to build highly available, production-grade systems that can scale with our products.

This role is ideal for someone who enjoys designing resilient architectures, automating everything, and partnering closely with backend/frontend engineers (Python, Java, and JavaScript stacks) to ship safely and operate confidently.
 

Responsibilities

  • Design and operate AWS infrastructure for scalable, secure, and highly available production environments.
  • Build and maintain Infrastructure-as-Code using Terraform (modules, state management, workspaces, best practices).
  • Manage container orchestration and workloads using Rancher (Kubernetes lifecycle, clusters, upgrades, policies).
  • Own CI/CD pipelines (build, test, deploy), release workflows, and environment promotion strategies.
  • Implement reliability practices: monitoring, logging, alerting, incident response, postmortems, SLOs/SLIs.
  • Design high availability and disaster recovery approaches (multi-AZ, backups, failover, runbooks).
  • Manage Linux systems and networking: hardening, performance tuning, troubleshooting, OS-level automation.
  • Support and optimize VM-based workloads where needed (images, scaling, patching, sizing, cost control).
  • Partner with engineering teams to improve operability for Python/Java/JS services (deploy patterns, configs, secrets, rollback).
  • Establish secure secrets management, IAM best practices, and least-privilege access across environments.
  • Drive continuous improvement in infrastructure security, stability, and developer experience.
     

Requirements

  • 6+ years of commercial DevOps / SRE / Platform Engineering experience (or equivalent).
  • Strong AWS experience across core services (e.g., VPC, EC2, IAM, ALB/NLB, RDS, S3, CloudWatch) and production operations.
  • Strong Terraform experience: reusable modules, remote state, state locking, drift detection, and multi-environment setups.
  • Hands-on Rancher experience managing Kubernetes clusters in production (upgrades, node pools, ingress, RBAC, policies).
  • Expert Linux skills: debugging, networking basics, system performance, permissions, automation, and security hardening.
  • Proven experience designing and operating high availability systems (multi-AZ, failover, redundancy, DR planning).
  • Solid understanding of virtual machines: compute sizing, images, networking, storage, patching, and lifecycle management.
  • Strong scripting/automation skills with a Python stack for DevOps (automation tooling, CLI scripts, operational utilities).
  • Working knowledge of Java stack operations (JVM service deployment patterns, tuning basics, observability, rollout/rollback).
  • Working knowledge of JS stack operations (Node.js services, build/release, environment config, runtime monitoring).
  • Experience with containers and Kubernetes fundamentals: deployments, services, ingress, configmaps/secrets, autoscaling.
  • Familiarity with secure SDLC practices: secrets, IAM policies, vulnerability scanning, least privilege, audit readiness.
  • Strong troubleshooting mindset with demonstrated production incident ownership.
  • Excellent communication skills and ability to collaborate in a distributed team.
     

Nice to Have

  • Experience with GitOps (e.g., Argo CD / Flux), Helm, and progressive delivery (blue/green, canary).
  • Experience with centralized logging/metrics stacks (Prometheus, Grafana, ELK/OpenSearch).
  • Experience optimizing cloud costs (rightsizing, reserved instances/savings plans, storage lifecycle policies).
  • Experience with service mesh / advanced Kubernetes networking (as relevant to the environment).
  • Prior experience supporting AI/LLM-heavy workloads (bursty traffic, queueing, cost monitoring, reliability patterns).
     

What to Include (Helps Us Review Faster)

  • A short write-up of a production system you operated (AWS + Terraform + Rancher), including HA/DR approach.
  • Examples of incident ownership (what happened, how you mitigated, what you improved afterward).
  • Links to IaC examples (Terraform modules), pipeline examples, or public repos (if available).

Required languages

English C1 - Advanced
CI/CD, Prometheus+Grafana, DevOps, AWS, Git
Published 19 January
20 views
·
5 applications
40% read
To apply for this and other jobs on Djinni login or signup.
Loading...