AI Tech lead/ Senior developer (ML Ops Architect / ML Infrastructure Lead)
As AI Expert in the CTO’s office, you will be the technical owner of everything model-powered inside the company:
- Architecture – Design the end-to-end pipeline that ingests org context, routes to the right expert model, executes code in sandboxed containers, and feeds rich telemetry back into our continuous-learning loop.
- Model Strategy – Decide when we fine-tune open-source Llama-3 vs. hot-swap to Bedrock or Vertex; benchmark MoE routers for latency and cost; champion vLLM/Triton for GPU efficiency.
- MLOps at Scale – Own versioning, lineage, policy gating and roll-back of models and in-line tools. Ship deterministic, reproducible releases that DevSecOps trusts.
- Tooling & Integrations – Work with backend and platform leads to expose new model endpoints through our Model Context Protocol (MCP) so agents can compose actions across GitHub, Jira, Terraform, Prometheus and more — without one-off plugins.
- Thought Leadership – Partner with the CTO on the technical roadmap, publish internal RFCs, mentor engineers and evangelize best practices across the company and open-source community.
What You’ll Do
- Craft cloud-native, micro-service architectures for training, fine-tuning and real-time inference (AWS/GCP/Azure, Kubernetes, JetStream).
- Define SLOs for p95 agent latency, model success rate, and telemetry coverage; instrument with OTEL, Prometheus and custom reward models.
- Drive our continuous-learning loop: reward modelling, ContextGraph enrichment, auto-tuning MoE routers.
- Embed least-privilege IAM and OPA/ABAC policy checks into every stage of the model lifecycle.
- Collaborate with product managers to translate customer pain into roadmap items and with design partners to validate solutions in production.
- Mentor a cross-functional squad of backend engineers, ML engineers and data scientists.
What You'll Bring
- 5+ years in software engineering, with 3+ years architecting large-scale backend systems (Python, Go, Java or similar).
- 4+ years designing, deploying and monitoring AI/ML systems in production.
- Deep expertise in at least one of: large-language-model serving, MoE routing, RLHF, vector search, streaming inference.
- Hands-on fluency with Kubernetes, Docker, CI/CD, IaC (Terraform/Helm) and distributed data technologies (Kafka, Spark, Arrow).
- Proven MLOps track record (MLflow, Kubeflow, SageMaker, or similar) and a security-first mindset.
- Ability to turn ambiguous business goals into a crisp, scalable architecture — and to communicate that vision to both executives and engineers.
- Great Englich communication skills.
Nice-to-Haves
- PhD or publications in ML/NLP/Systems.
- Contributions to open-source LLM or MLOps projects.
- Experience pushing real-time inference to the edge or FPGA/ASIC accelerators.
Prior leadership of cross-functional AI/ML teams in a fast-growing startup environment.
The Way We Work
We value clarity, ownership, and velocity. You’ll have direct access to the CTO, autonomy to choose the right tech, and a front-row seat as we redefine how enterprises move “from prompt to production.”
If building the Kubernetes of AI-driven operations excites you, let’s talk.
Published 28 May
45 views
·
10 applications
100% read
·
90% responded
Last responded 3 days ago
📊
Average salary range of similar jobs in
analytics →
Loading...