AI Stack Engineer Offline

StartupSoft

WE ARE: StartupSoft connects top Ukrainian engineers with world-class startups from Silicon Valley and EU. Our developers work directly on the product as an integral part of the startup team.

PROJECT: This cloud-based solution simplifies running AI workloads—like training, fine-tuning, and inference—by automatically selecting the most efficient infrastructure based on speed, cost, and reliability. It supports a wide range of models, including large language models (LLMs), computer vision, and retrieval-augmented generation (RAG), with deployment options across cloud, on-premises, or hybrid environments.
With Workload as a Service (WaaS), it removes the complexity of managing infrastructure, ensuring AI workloads run smoothly and without constraints—anywhere, anytime.

TEAM: most of the engineering team is in Europe.

PROJECT STAGE: Startup (2023).

PROJECT STACK: PyTorch, HPC, AWS, CI/CD

REQUIREMENTS:

5+ years of experience in software engineering, with a focus on runtime systems or performance optimization for large-scale distributed systems.
Strong expertise in low-level performance optimizations and systems programming (C/C++, Go, etc.), with Python experience preferred.
A Bachelor’s or Master’s degree in Computer Science, Software Engineering, or a related technical field.
Proven experience working with distributed systems, and the ability to optimize runtime environments for AI workloads.
Excellent problem-solving skills with the ability to innovate and think outside the box in a fast-paced, evolving environment.
Strong communication and collaboration skills, comfortable working cross-functionally with engineering, research, and product teams.

NICE TO HAVE:

Experience working with PyTorch or similar deep learning frameworks.
Familiarity with AI model training pipelines, model deployment processes, or high-performance computing (HPC) environments.
Experience working in start-up environments or high-growth tech companies with an entrepreneurial mindset.

RESPONSIBILITIES:

Improve LLM Training Reliability and Elasticity:

Design and implement solutions to make PyTorch more elastic and resilient, enabling fault tolerance and dynamic scaling of training jobs.
Collaborate with teams to enhance PyTorch functionalities and reduce training downtime, optimizing large language model (LLM) training workflows.

Optimize Model Packaging and Production Runtime:

Ensure seamless integration of customer models with our custom PyTorch stack.
Manage and improve the production runner, focusing on performance, scalability, and deployment processes to ensure efficient model execution.

Develop and Maintain Internal Tools for Model Training:

Contribute to the development of internal libraries and tools to improve the training process, including implementing asynchronous operations and fault recovery mechanisms.
Maintain code quality, enforce best practices, and ensure continuous integration for production readiness.

The job ad is no longer active

Look at the current jobs Python →

Only from Upper-Intermediate
Only from 5 years of experience
Full Remote
Countries of Europe or Ukraine
Countries where we consider candidates

Python
Python, PyTorch

Domain: Other
Outstaff

Apply for the job

📊 $4000-5500 Average salary range of similar jobs in analytics →

Similar jobs

BackEnd Big Data Developer at INCOAlliance

Worldwide

Senior Python Engineer (Data oriented) at YozmaTech

Countries of Europe or Ukraine

Senior Python → Team Lead Python at King's Choice

Ukraine

All jobs StartupSoft →