AI Stack Engineer Offline

WE ARE: StartupSoft connects top Ukrainian engineers with world-class startups from Silicon Valley and EU. Our developers work directly on the product as an integral part of the startup team.

PROJECT: This cloud-based solution simplifies running AI workloads—like training, fine-tuning, and inference—by automatically selecting the most efficient infrastructure based on speed, cost, and reliability. It supports a wide range of models, including large language models (LLMs), computer vision, and retrieval-augmented generation (RAG), with deployment options across cloud, on-premises, or hybrid environments.
With Workload as a Service (WaaS), it removes the complexity of managing infrastructure, ensuring AI workloads run smoothly and without constraints—anywhere, anytime.

TEAM: most of the engineering team is in Europe.

PROJECT STAGE: Startup (2023).

PROJECT STACK: PyTorch, HPC, AWS, CI/CD

REQUIREMENTS:

  • 5+ years of experience in software engineering, with a focus on runtime systems or performance optimization for large-scale distributed systems.
  • Strong expertise in low-level performance optimizations and systems programming (C/C++, Go, etc.), with Python experience preferred.
  • A Bachelor’s or Master’s degree in Computer Science, Software Engineering, or a related technical field.
  • Proven experience working with distributed systems, and the ability to optimize runtime environments for AI workloads.
  • Excellent problem-solving skills with the ability to innovate and think outside the box in a fast-paced, evolving environment.
  • Strong communication and collaboration skills, comfortable working cross-functionally with engineering, research, and product teams.

NICE TO HAVE:

  • Experience working with PyTorch or similar deep learning frameworks.
  • Familiarity with AI model training pipelines, model deployment processes, or high-performance computing (HPC) environments.
  • Experience working in start-up environments or high-growth tech companies with an entrepreneurial mindset.

RESPONSIBILITIES:

Improve LLM Training Reliability and Elasticity:

  • Design and implement solutions to make PyTorch more elastic and resilient, enabling fault tolerance and dynamic scaling of training jobs.
  • Collaborate with teams to enhance PyTorch functionalities and reduce training downtime, optimizing large language model (LLM) training workflows.

Optimize Model Packaging and Production Runtime:

  • Ensure seamless integration of customer models with our custom PyTorch stack.
  • Manage and improve the production runner, focusing on performance, scalability, and deployment processes to ensure efficient model execution.

Develop and Maintain Internal Tools for Model Training:

  • Contribute to the development of internal libraries and tools to improve the training process, including implementing asynchronous operations and fault recovery mechanisms.
  • Maintain code quality, enforce best practices, and ensure continuous integration for production readiness.

The job ad is no longer active

Look at the current jobs Python →