Corvex
We are at the forefront of AI and high-performance computing, leveraging cutting-edge GPU infrastructure to power advanced research.
-
Network Ops Engineer - High-Performance Computing and Cloud Infrastructure
Full Remote Β· Worldwide Β· 10 years of experience Β· Upper-IntermediatePosition Overview: We are looking for a Network Engineer to design, optimize, ensuring a reliable, secure, and scalable high-performance network infrastructure. The ideal candidate will prioritize system reliability, proactively implementing best...Position Overview:
We are looking for a Network Engineer to design, optimize, ensuring a reliable, secure, and scalable high-performance network infrastructure. The ideal candidate will prioritize system reliability, proactively implementing best practices to minimize downtime and ensure consistent performance for mission-critical workloads. A strong commitment to tenant isolation and network security is essential, ensuring secure, multi-tenant environments for our customers.
As a Network Engineer, you will design, deploy, and maintain high-performance network infrastructure for AI/ML applications, data centers, and private GPU clouds. You will work with cutting-edge networking technologies, including RDMA, InfiniBand, RoCE, and cloud-based networking solutions, to ensure low-latency, high-throughput communication across our distributed systems.Key Responsibilities:
- Design, deploy, and maintain high-speed networking infrastructure for private and cloud-based AI/ML workloads.
- Configure and optimize RDMA/InfiniBand and RoCE networks for high-performance computing (HPC) clusters.
Manage networking for NVIDIA GPU cloud environments, ensuring low-latency data transfers. - Implement and manage network security policies, fi rewalls, and VPNs.
- Automate network management tasks using Python, Ansible, or other scripting tools.
- Monitor, troubleshoot, and optimize network performance using tools like Prometheus, Grafana, and Wireshark.
- Collaborate with DevOps, AI, and cloud infrastructure teams to ensure seamless connectivity across distributed environments.
- Design and manage software-defined networking (SDN) solutions for private and hybrid cloud deployments.
- Ensure compliance with networking best practices, security policies, and industry regulations.
Required Skills & Qualifications:
- Education: Computer Science, Networking, or a related fi eld (or equivalent experience).
- Networking Protocols: Deep understanding of TCP/IP, UDP, MLAG, GRE, BGP, VLANs, VXLANs, and MPLS.
- High-Performance Networking: Experience with RDMA, Infiniband, Bare Metal, Virtualized Partitioning, and NVIDIA/Mellanox Cumulus networking solutions.
- Network Security: Proficiency in firewalls (pfSense), IDS/IPS, and Wireguard VPN technologies.
- Cloud Networking: Experience with virtual networking, and cloud networking solutions.
- Automation & Scripting: Knowledge of scripting languages, Ansible, Terraform, for network automation.
- Monitoring & Troubleshooting: Familiarity with Wireshark, tcpdump, Prometheus, Grafana, and SNMP-based monitoring.
- Infrastructure as Code (IaC): Experience with automating network configurations using scripting tools leveraging Terraform and Ansible. Preferred Qualifications:
- Experience in designing network architectures for AI/ML and high-performance computing (HPC) environments.
- Knowledge of data center networking with large-scale deployments.
- Experience with zero-trust security frameworks. Soft Skills:
- Ability to adapt and thrive in an ambiguous, fast-moving startup environment.
- Strong problem-solving skills and the ability to diagnose complex networking issues.
- A strong sense of ownership and accountability, with a commitment to end-to-end problem-solving working with external customers.
Excellent communication skills, with the ability to work across DevOps, AI, and cloud engineering teams. - Detail-oriented mindset, ensuring high availability and reliability of network infrastructure.
Benefits:
- Flexible hybrid work arrangements.
- Performance-based bonuses.
- Access to cutting-edge AI/ML infrastructure.