Senior DevOps Engineer

Hi there,

We are looking for a Senior DevOps Engineer to help design, automate, and manage large-scale AI infrastructure.

You will work closely with network engineers, IT teams, and AI/ML specialists to build secure, resilient, and scalable solutions across GPU, CPU, storage, and networking layers.

 

About project:
Our client is shaping the future of Artificial Intelligence by providing high-performance, scalable infrastructure. Their mission is to drive innovation across industries by delivering reliable platforms for training and deploying AI models, including LLMs, computer vision, and generative AI.

They design and operate AI-optimized data centers with advanced GPU infrastructure, secure networking, and enterprise-level reliability—helping their customers seamlessly transition from AI concepts to full-scale production.

 

Qualifications:

  • 6/7+ years of experience in DevOps, Infrastructure, or Site Reliability Engineering.
  • Expertise in Kubernetes, Docker, Terraform, and infrastructure-as-code (IaC) methodologies.
  • Strong scripting and automation skills in Python, Bash, or Go.
  • Deep knowledge of high-performance networking, including InfiniBand, RoCE, and out-of-band (OOB) management.
  • Hands-on experience with firewall security (FortiGate preferred), routing policies, and network segmentation.
  • Experience in distributed computing, high-performance computing (HPC), or AI-driven environments is a strong plus.

     

Key Responsibilities:

  • Infrastructure Automation: Design and automate infrastructure provisioning, configuration, and management using Terraform, Ansible, or similar tools.
  • Containerized Environments: Develop and maintain high-availability AI workloads using Docker and Kubernetes.
  • GPU/CPU Management: Automate provisioning, backup, and recovery processes for GPU clusters and CPU infrastructure.
  • Security & Networking: Implement and manage network segmentation, IP access controls, and secure API authentication workflows.
  • Firewall & Routing: Collaborate with the Networking team to integrate automated firewall configurations, routing logic, and multi-ISP resilience strategies.
  • Monitoring & Optimization: Monitor system performance, set up alerts for anomalies in GPU, CPU, and storage usage, and optimize resource utilization.
  • API & Automation: Write and manage internal API calls and integrations to automate access, provisioning, and scaling operations.
  • Remote Access & OOB Management: Support the setup of out-of-band management systems and remote access controls for data center infrastructure.
  • Disaster Recovery: Contribute to disaster recovery planning, implement automated rollback testing, and ensure failover strategies.
  • System Monitoring Tools: Utilize tools like IPMI and Redfish to automate system monitoring and data collection.
  • Documentation & Best Practices: Maintain clear, comprehensive documentation for workflows, procedures, and infrastructure design.

     

What we offer:

  • Flexible Work Environment: Opportunity to work remotely or in our safe office in Kyiv.
  • Premium Medical Insurance: Comprehensive health insurance to ensure your well-being.
  • 1:1 English Classes: Individual English language training to enhance your communication skills.
  • Great Team: Work with a supportive, collaborative, and dynamic international team.
  • Equipment Provided: All necessary equipment supplied for efficient job performance.
  • Annual Vacation: 18 days of paid vacation and 7 days of paid sick leave.
  • Commitment to Hiring Ukrainians: We are dedicated to hiring Ukrainian talent and promoting Ukraine as a fantastic place to work.
  • Flexible payment system, which allows you to withdraw funds in one click and has about twenty withdrawal options.
Published 7 April
54 views
·
6 applications
100% read
·
84% responded
Last responded 3 weeks ago
To apply for this and other jobs on Djinni login or signup.