HPC Systems Engineer

We are seeking an experienced HPC Systems Engineer to design and manage the high-performance computing infrastructure supporting our AI workloads — from deep learning training to large-scale data processing for educational analytics.

Responsibilities

Architect and maintain HPC clusters and GPU-based compute environments for ML model training.
Optimize distributed training pipelines (PyTorch DDP, Horovod, or Ray).
Manage job scheduling systems (Slurm, Kubernetes, or Azure Batch) for efficient workload allocation.
Ensure scalable data access and high-throughput pipelines between storage and compute nodes.
Collaborate with AI/ML and MLOps teams to integrate training pipelines with CI/CD workflows.
Implement security, resource monitoring, and cost optimization practices for compute clusters.
Automate cluster provisioning and teardown to support on-demand AI workloads.

Requirements

5+ years managing HPC or GPU-based compute infrastructure.
Proficiency with Linux, Slurm, Kubernetes, Docker, and Terraform.
Hands-on experience with Azure Batch, AWS ParallelCluster, or on-prem HPC setups.
Strong understanding of networking, parallel file systems, and GPU performance tuning.
Familiarity with data pipelines and model training orchestration.
Excellent scripting skills (Bash, Python).

Required languages

English

B1 - Intermediate

Published 14 November

12 views

2 applications

To apply for this and other jobs on Djinni login or signup.

Only from 4 years of experience
Full Remote
Countries of Europe or Ukraine
Countries where we consider candidates
English B1 - Intermediate

ML / AI

Employment: Fulltime
Domain: Education
Product

Apply for the job

📊 Average salary range of similar jobs in analytics →