DevOps Engineer โ AWS / ML Infrastructure
We are seeking a skilled and proactive DevOps Engineer to join our team. This role is focused on developing and managing scalable infrastructure and deployment workflows in AWS to support data-driven and machine learning applications.
You will play a key role in building cloud-native systems with a strong emphasis on infrastructure as code, containerization, and CI/CD pipelines.
A solid understanding of AWS services and Python is essential, particularly for authoring infrastructure using AWS CDK. Experience with SageMaker and knowledge of ML systems is a strong advantage.
Qualifications:
- 3โ5 years of experience in DevOps, cloud infrastructure, or SRE roles.
- Proficient in AWS services, especially CDK, Lambda, EC2, S3, SageMaker, and CloudWatch.
- Experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation.
- Strong experience with Python for scripting and infrastructure automation.
- Hands-on experience with containerization (Docker).
- Experience building and maintaining CI/CD pipelines.
Preferred Qualifications:
- AWS Certifications (e.g., DevOps Engineer, Solutions Architect, or Machine Learning Specialty).
- Background in software engineering or ML/AI infrastructure is a plus.
Key Responsibilities:
Infrastructure Development & Automation:
- Design, provision, and manage AWS infrastructure using AWS CDK and CloudFormation.
- Develop secure, scalable, and cost-efficient infrastructure to support machine learning and analytics workloads.
- Implement and manage cloud-native services such as EC2, ECS, Lambda, S3, RDS, SageMaker, and Bedrock.
- Ensure best practices for security, compliance, and disaster recovery are followed.
CI/CD & Deployment Automation:
- Design and maintain CI/CD pipelines for application and model deployment using tools like CodePipeline, CodeBuild, GitHub Actions, or similar.
- Automate testing, deployment, and rollback procedures to support continuous integration and delivery.
Containerization & Orchestration:
- Build and manage Docker containers for microservices and ML applications.
- Support deployment on ECS or Lambda with container-based runtimes.
- Implement image build, versioning, and artifact management workflows.
Machine Learning & Model Operations Support:
- Collaborate with ML engineers to deploy, monitor, and maintain models in SageMaker.
- Integrate infrastructure for pre-processing, inference, and retraining pipelines.
- Support model performance monitoring, logging, and metrics collection.
Monitoring, Observability & Logging:
- Set up monitoring and alerting using CloudWatch, DataDog, and other observability tools.
- Troubleshoot and resolve infrastructure, deployment, and performance issues proactively.
Collaboration & Documentation:
- Work closely with software, ML, and data teams to support DevOps best practices across the ML lifecycle.
- Maintain clear documentation for infrastructure, deployments, and operations processes.
- Participate in code reviews and architectural discussions.