Systems Engineer (HPC) to $4000
We are seeking a highly skilled Systems Engineer specializing in High Performance Computing (HPC) to support, maintain, and optimize our HPC infrastructure. The ideal candidate has deep technical expertise, hands-on experience with HPC environments, and a strong understanding of performance engineering, systems operations, and automation.
Project Start: ASAP
Project Duration: Until December 2026
Location: Remote (with on‑site onboarding in Cologne)
English: Fluent
German: as a plus
Responsibilities
Incident & Service Operations
Incident Management: Respond to, diagnose, and resolve HPC-related incidents to ensure system stability and minimize downtime.
Service Request Management: Process and fulfill service requests related to HPC resources, tooling, and services.
Technical Tasks
Troubleshooting: Investigate and resolve complex technical issues across HPC clusters, applications, networking, and performance workflows.
Testing & Validation: Develop, execute, and document test plans to validate system reliability, scalability, and performance.
Documentation: Create and maintain detailed documentation on system architecture, configurations, workflows, and optimizations.
Manage, monitor, and optimize HPC clusters, job scheduling systems, and related infrastructure.
Analyze performance bottlenecks and apply optimization techniques across compute, memory, and networking layers.
Support software development, integration, and deployment workflows within HPC environments.
Required Qualifications
Minimum 3 years of experience in software development and/or systems engineering with a strong focus on HPC environments.
Expertise in Linux operating systems, specifically Red Hat Enterprise Linux (RHEL).
Strong programming/scripting skills: C, C++, Python, Bash, Ansible
Hands-on experience with parallel computing frameworks: MPI, OpenMP, CUDA
Solid knowledge of computer architecture, performance tuning, and system optimization.
Experience managing HPC clusters, including job schedulers (e.g., Slurm, PBS, LSF).
Strong networking knowledge, particularly InfiniBand.
Understanding of ITIL best practices, especially: Incident Management, Service Management, Process Optimization
Soft Skills
Strong analytical and problem-solving capabilities
Ability to work in distributed, remote teams
Clear communication and documentation skills
Proactive, structured, and solution-oriented mindset
Required languages
| English | C1 - Advanced |
| German | B2 - Upper Intermediate |