HPC Systems Engineer

We are seeking an experienced HPC Systems Engineer to design and manage the high-performance computing infrastructure supporting our AI workloads โ€” from deep learning training to large-scale data processing for educational analytics.

Responsibilities

  • Architect and maintain HPC clusters and GPU-based compute environments for ML model training.
  • Optimize distributed training pipelines (PyTorch DDP, Horovod, or Ray).
  • Manage job scheduling systems (Slurm, Kubernetes, or Azure Batch) for efficient workload allocation.
  • Ensure scalable data access and high-throughput pipelines between storage and compute nodes.
  • Collaborate with AI/ML and MLOps teams to integrate training pipelines with CI/CD workflows.
  • Implement security, resource monitoring, and cost optimization practices for compute clusters.
  • Automate cluster provisioning and teardown to support on-demand AI workloads.

Requirements

  • 5+ years managing HPC or GPU-based compute infrastructure.
  • Proficiency with Linux, Slurm, Kubernetes, Docker, and Terraform.
  • Hands-on experience with Azure Batch, AWS ParallelCluster, or on-prem HPC setups.
  • Strong understanding of networking, parallel file systems, and GPU performance tuning.
  • Familiarity with data pipelines and model training orchestration.
  • Excellent scripting skills (Bash, Python).

Required languages

English B1 - Intermediate
Published 14 November
12 views
ยท
2 applications
To apply for this and other jobs on Djinni login or signup.
Loading...