AI Stack Engineer Offline
WE ARE: StartupSoft connects top Ukrainian engineers with world-class startups from Silicon Valley and EU. Our developers work directly on the product as an integral part of the startup team.
PROJECT: This cloud-based solution simplifies running AI workloads—like training, fine-tuning, and inference—by automatically selecting the most efficient infrastructure based on speed, cost, and reliability. It supports a wide range of models, including large language models (LLMs), computer vision, and retrieval-augmented generation (RAG), with deployment options across cloud, on-premises, or hybrid environments.
With Workload as a Service (WaaS), it removes the complexity of managing infrastructure, ensuring AI workloads run smoothly and without constraints—anywhere, anytime.
TEAM: most of the engineering team is in Europe.
PROJECT STAGE: Startup (2023).
PROJECT STACK: PyTorch, HPC, AWS, CI/CD
REQUIREMENTS:
- 5+ years of experience in software engineering, with a focus on runtime systems or performance optimization for large-scale distributed systems.
- Strong expertise in low-level performance optimizations and systems programming (C/C++, Go, etc.), with Python experience preferred.
- A Bachelor’s or Master’s degree in Computer Science, Software Engineering, or a related technical field.
- Proven experience working with distributed systems, and the ability to optimize runtime environments for AI workloads.
- Excellent problem-solving skills with the ability to innovate and think outside the box in a fast-paced, evolving environment.
- Strong communication and collaboration skills, comfortable working cross-functionally with engineering, research, and product teams.
NICE TO HAVE:
- Experience working with PyTorch or similar deep learning frameworks.
- Familiarity with AI model training pipelines, model deployment processes, or high-performance computing (HPC) environments.
- Experience working in start-up environments or high-growth tech companies with an entrepreneurial mindset.
RESPONSIBILITIES:
Improve LLM Training Reliability and Elasticity:
- Design and implement solutions to make PyTorch more elastic and resilient, enabling fault tolerance and dynamic scaling of training jobs.
- Collaborate with teams to enhance PyTorch functionalities and reduce training downtime, optimizing large language model (LLM) training workflows.
Optimize Model Packaging and Production Runtime:
- Ensure seamless integration of customer models with our custom PyTorch stack.
- Manage and improve the production runner, focusing on performance, scalability, and deployment processes to ensure efficient model execution.
Develop and Maintain Internal Tools for Model Training:
- Contribute to the development of internal libraries and tools to improve the training process, including implementing asynchronous operations and fault recovery mechanisms.
- Maintain code quality, enforce best practices, and ensure continuous integration for production readiness.
The job ad is no longer active
Look at the current jobs Python →