DevOps / MLOps Engineer Offline

Requirements:

  • Proven experience in deploying and monitoring machine learning models in production environments.
  • 5+ years of experience working with Docker, Kubernetes, Helm, and CI/CD pipelines and best practices.
  • 5+ years of experience with observability tools such as Prometheus, Thanos, and Grafana.
  • Familiarity with model monitoring tools such as Arize, Evidently AI, and Alibi Detect.
  • Experience with A/B testing and service mesh software such as Istio.
  • Proficiency in using platforms like Kubeflow and OpenDataHub for model deployment and management.
  • Strong understanding of infrastructure monitoring and observability best practices.
  • Excellent problem-solving skills and the ability to troubleshoot complex issues.
  • Experience with cloud platforms such as AWS, Google Cloud Platform (GCP), or Azure.
  • Knowledge of scripting and automation tools (e.g., Bash, Python)

 

 

Responsibilities:

  • Deploy and manage machine learning models on Kubernetes clusters.
  • Develop robust data pipelines for training, inference, and analytics purposes.
  • Monitor and manage cluster infrastructure using Prometheus, Thanos, and other observability tools.
  • Implement and maintain model observability frameworks using tools like Arize.
  • Implement A/B testing strategies and software, such as Istio, to evaluate model performance and reliability.
  • Develop and maintain dashboards and alerting systems to track model performance and data quality metrics.
  • Collaborate with data scientists and ML engineers to ensure models are robustly monitored, and issues are quickly identified and resolved.
  • Ensure the scalability and reliability of model deployment pipelines using Kubernetes.
  • Stay up to date with the latest advancements in model observability and infrastructure monitoring technologies.

The job ad is no longer active

Look at the current jobs DevOps →

Loading...