HPC Systems Engineer
We are seeking an experienced HPC Systems Engineer to design and manage the high-performance computing infrastructure supporting our AI workloads โ from deep learning training to large-scale data processing for educational analytics.
Responsibilities
- Architect and maintain HPC clusters and GPU-based compute environments for ML model training.
- Optimize distributed training pipelines (PyTorch DDP, Horovod, or Ray).
- Manage job scheduling systems (Slurm, Kubernetes, or Azure Batch) for efficient workload allocation.
- Ensure scalable data access and high-throughput pipelines between storage and compute nodes.
- Collaborate with AI/ML and MLOps teams to integrate training pipelines with CI/CD workflows.
- Implement security, resource monitoring, and cost optimization practices for compute clusters.
- Automate cluster provisioning and teardown to support on-demand AI workloads.
Requirements
- 5+ years managing HPC or GPU-based compute infrastructure.
- Proficiency with Linux, Slurm, Kubernetes, Docker, and Terraform.
- Hands-on experience with Azure Batch, AWS ParallelCluster, or on-prem HPC setups.
- Strong understanding of networking, parallel file systems, and GPU performance tuning.
- Familiarity with data pipelines and model training orchestration.
- Excellent scripting skills (Bash, Python).
Required languages
| English | B1 - Intermediate |
Published 14 November
12 views
ยท
2 applications
๐
Average salary range of similar jobs in
analytics โ
Loading...