Senior/Lead GPU Infrastructure Engineer

$$$$

Akvelon is an international IT company headquartered in the US, with offices in Seattle, Mexico, Ukraine, Poland, and Serbia. We are an official vendor of Microsoft and Google, and our clients include global leaders such as Amazon, Evernote, Intel, HP, Reddit, Pinterest, AT&T, T-Mobile, Starbucks, and LinkedIn.

By joining Akvelon, you become part of strong engineering teams around the world, building modern products — from Enterprise and CRM systems to Cloud, AI/ML, Mobile, and cross-platform solutions.

Here, you work with cutting-edge technologies, contributing to scalable systems used by millions of users.

Project Overview:

We are looking for a Senior / Lead GPU Infrastructure Engineer to join a cutting-edge project focused on building next-generation infrastructure for on-demand GPU access and large-scale compute environments.

The platform enables researchers to efficiently access and manage GPU resources across:

High-performance GPU workstations
Large-scale GPU clusters
Cloud-based environments

You will work on core infrastructure systems, including GPU scheduling, cluster orchestration, automation, and developer environment standardization.

The project starts with an MVP phase, with further development planned based on product roadmap and business priorities.

Responsibilities:

Design and develop infrastructure for GPU workstations and clusters
Work with Kubernetes-based environments for orchestration and scheduling
Build and improve GPU allocation, reservation, and scheduling systems
Automate infrastructure provisioning using Terraform / Ansible
Implement GPU sharing and virtualization using NVIDIA technologies (MIG, time slicing, DCGM)
Develop automation scripts (Python / Bash / PowerShell)
Build monitoring and observability pipelines (Prometheus, Grafana)
Define alerting, SLIs, and incident automation workflows
Work with cloud GPU infrastructure (Azure)
Integrate CI/CD, GitHub workflows, and developer environments (devcontainers, etc.)
Collaborate with distributed engineering teams and stakeholders
For Lead-level candidates: provide technical leadership, guide engineering decisions, mentor team members, and coordinate delivery

Requirements:

Strong hands-on experience with Kubernetes
Experience with GPU scheduling, GPU infrastructure, GPU cluster management, HPC, or ML compute platforms
Practical knowledge of the NVIDIA ecosystem, such as MIG, Time-slicing, DCGM, GPU telemetry, GPU partitioning, or GPU resource optimization
Experience with Infrastructure as Code, preferably Terraform or Ansible
Strong scripting skills with Python, PowerShell, Bash, or similar tools
Experience working with Linux-based infrastructure and distributed systems
Understanding of monitoring, observability, and infrastructure automation
Good communication skills and ability to work directly with technical stakeholders
English level: Upper-Intermediate or higher

Additional Requirements for Lead Level:

Previous team leadership or technical leadership experience
Ability to own technical direction and architecture decisions
Experience mentoring engineers and coordinating technical delivery
Ability to communicate risks, tradeoffs, and implementation plans clearly

Nice to Have:

Experience with Slurm or other HPC schedulers
Experience with Azure, AKS, Azure-hosted GPUs, or Azure infrastructure
Experience with Prometheus, Grafana, or similar monitoring tools
Experience with automated alerting, incident management, or IcM-like systems
Experience with DevContainers, VS Code integration, or GitHub Codespaces
Experience with CI/CD, GitHub Actions, Azure DevOps, or 1ES Hosted Runners
Experience building platforms for researchers, ML engineers, or data scientists
Knowledge of RBAC, SSO, secure access, and enterprise security requirements
Experience with CLI tools, MCP servers, agents, or automation for infrastructure workflows

Working conditions and benefits:

Initial contract until the end of the year, with potential prolongation depending on MVP outcomes and project roadmap
Paid vacation and sick leave (no medical certificate required)
Official state holidays — 11 public holidays per year
Professional growth through challenging projects and the opportunity to master new technologies
Personal Career Development Plan (CDP)
Employee support programs (discounts, healthcare, legal assistance)
Paid external training, conferences, and professional certifications aligned with business goals
Internal workshops and seminars

Ready to work on cutting-edge AI infrastructure? Apply now.

Required languages

English

B2 - Upper Intermediate

Kubernetes, GPU infrastructure, NVIDIA ecosystem, MIG, DCGM, Infrastructure as Code, Terraform, Python, Slurm, Azure

Published 28 April

6 views

0 applications

To apply for this and other jobs on Djinni login or signup.

Only from 4 years of experience
Full Remote
EU
Countries where we consider candidates
- English B2 - Upper Intermediate

Data Engineer

Employment: Fulltime
Domain: Other
Outsource

Apply for the job

📊 $2500-4000 Average salary range of similar jobs in analytics →