Senior/Lead GPU Infrastructure Engineer

$$$$

Akvelon is an international IT company headquartered in the US, with offices in Seattle, Mexico, Ukraine, Poland, and Serbia. We are an official vendor of Microsoft and Google, and our clients include global leaders such as Amazon, Evernote, Intel, HP, Reddit, Pinterest, AT&T, T-Mobile, Starbucks, and LinkedIn.


By joining Akvelon, you become part of strong engineering teams around the world, building modern products โ€” from Enterprise and CRM systems to Cloud, AI/ML, Mobile, and cross-platform solutions. 

Here, you work with cutting-edge technologies, contributing to scalable systems used by millions of users.


Project Overview:

We are looking for a Senior / Lead GPU Infrastructure Engineer to join a cutting-edge project focused on building next-generation infrastructure for on-demand GPU access and large-scale compute environments.


The platform enables researchers to efficiently access and manage GPU resources across:

  • High-performance GPU workstations
  • Large-scale GPU clusters
  • Cloud-based environments


You will work on core infrastructure systems, including GPU scheduling, cluster orchestration, automation, and developer environment standardization.


 The project starts with an MVP phase, with further development planned based on product roadmap and business priorities.


Responsibilities:

  • Design and develop infrastructure for GPU workstations and clusters
  • Work with Kubernetes-based environments for orchestration and scheduling
  • Build and improve GPU allocation, reservation, and scheduling systems
  • Automate infrastructure provisioning using Terraform / Ansible
  • Implement GPU sharing and virtualization using NVIDIA technologies (MIG, time slicing, DCGM)
  • Develop automation scripts (Python / Bash / PowerShell)
  • Build monitoring and observability pipelines (Prometheus, Grafana)
  • Define alerting, SLIs, and incident automation workflows
  • Work with cloud GPU infrastructure (Azure)
  • Integrate CI/CD, GitHub workflows, and developer environments (devcontainers, etc.)
  • Collaborate with distributed engineering teams and stakeholders
  • For Lead-level candidates: provide technical leadership, guide engineering decisions, mentor team members, and coordinate delivery


Requirements:

  • Strong hands-on experience with Kubernetes
  • Experience with GPU scheduling, GPU infrastructure, GPU cluster management, HPC, or ML compute platforms
  • Practical knowledge of the NVIDIA ecosystem, such as MIG, Time-slicing, DCGM, GPU telemetry, GPU partitioning, or GPU resource optimization
  • Experience with Infrastructure as Code, preferably Terraform or Ansible
  • Strong scripting skills with Python, PowerShell, Bash, or similar tools
  • Experience working with Linux-based infrastructure and distributed systems
  • Understanding of monitoring, observability, and infrastructure automation
  • Good communication skills and ability to work directly with technical stakeholders
  • English level: Upper-Intermediate or higher


Additional Requirements for Lead Level:

  • Previous team leadership or technical leadership experience
  • Ability to own technical direction and architecture decisions
  • Experience mentoring engineers and coordinating technical delivery
  • Ability to communicate risks, tradeoffs, and implementation plans clearly


Nice to Have:

  • Experience with Slurm or other HPC schedulers
  • Experience with Azure, AKS, Azure-hosted GPUs, or Azure infrastructure
  • Experience with Prometheus, Grafana, or similar monitoring tools
  • Experience with automated alerting, incident management, or IcM-like systems
  • Experience with DevContainers, VS Code integration, or GitHub Codespaces
  • Experience with CI/CD, GitHub Actions, Azure DevOps, or 1ES Hosted Runners
  • Experience building platforms for researchers, ML engineers, or data scientists
  • Knowledge of RBAC, SSO, secure access, and enterprise security requirements
  • Experience with CLI tools, MCP servers, agents, or automation for infrastructure workflows


Working conditions and benefits:

  • Initial contract until the end of the year, with potential prolongation depending on MVP outcomes and project roadmap
  • Paid vacation and sick leave (no medical certificate required)
  • Official state holidays โ€” 11 public holidays per year
  • Professional growth through challenging projects and the opportunity to master new technologies
  • Personal Career Development Plan (CDP)
  • Employee support programs (discounts, healthcare, legal assistance)
  • Paid external training, conferences, and professional certifications aligned with business goals
  • Internal workshops and seminars


Ready to work on cutting-edge AI infrastructure? Apply now.

Required languages

English B2 - Upper Intermediate
Kubernetes, GPU infrastructure, NVIDIA ecosystem, MIG, DCGM, Infrastructure as Code, Terraform, Python, Slurm, Azure
Published 28 April
6 views
ยท
0 applications
To apply for this and other jobs on Djinni login or signup.
Loading...