DevOps / SRE / AIOps Engineer – Google Cloud

JUTEQ is an AI-native and cloud-native technology consulting firm helping enterprises in financial services, telecom, and healthcare build intelligent, production-grade systems. We combine the power of GenAI, cloud architecture, and automation to deliver next-generation business tools.

We’re seeking a DevOps/Site Reliability Engineer (SRE) with experience in Google Cloud Platform (GCP) to lead and evolve our AI infrastructure as we scale multi-tenant agentic systems across automotive and enterprise use cases. This is a hands-on role working at the intersection of automation, observability, and production AI system reliability.

What You’ll Work On

Platform Reliability & Automation

  • Own deployment pipelines, autoscaling, and high-availability for AI microservices running on GCP (Cloud Run, GKE, App Engine)
  • Design and optimize CI/CD pipelines using Cloud Build, Skaffold, GitHub Actions
  • Implement intelligent autoscaling strategies based on LLM cost, latency, and throughput
  • Use Infrastructure as Code (Terraform, Deployment Manager) for repeatable cloud provisioning

Monitoring & Observability

  • Deploy monitoring and alerting across Cloud Logging, Cloud Monitoring, and custom dashboards for agent performance metrics
  • Define SLOs and SLIs for key services; implement failover and rollback strategies
  • Build observability into agent workflows: latency, success rate, AI token consumption, prompt drift, etc.

Data & AI Infrastructure

  • Manage access, scaling, and resilience of data services: BigQuery, Firestore, Memorystore, Cloud Storage, Pub/Sub
  • Support model integration workflows with Vertex AI and third-party LLM providers (OpenAI, Anthropic, etc.)
  • Monitor and secure retrieval pipelines (RAG, embedding generation, vector DBs)

Security & Compliance

  • Implement and maintain IAM policies, workload identity, and service-to-service authentication
  • Lead incident response and postmortem analysis for production outages
  • Ensure systems comply with data residency, privacy, and SOC2/GDPR requirements

What We’re Looking For

Experience & Skills

  • 4+ years of DevOps or SRE experience, with at least 2+ years on GCP
  • Strong understanding of GCP products including Cloud Run, GKE, Cloud Build, BigQuery, Pub/Sub, Cloud Monitoring
  • Experience with CI/CD and GitOps workflows (GitHub Actions, ArgoCD, etc.) and Observability/Monitoring
  • Deep knowledge of containerization, Docker, and Kubernetes
  • Familiarity with AI infrastructure (LLMs, prompt evaluation, LangChain/CrewAI patterns) is a strong plus
  • Experience with alerting and logging using Prometheus, Grafana, or GCP-native tools
  • Proficient in scripting (Python, Bash, Go preferred)

Bonus Points

  • Experience managing infrastructure for AI agent systems or GenAI workloads
  • Familiarity with multi-tenant SaaS platforms
  • Understanding of RAG pipelines, embedding generation, or agent orchestration
  • Certifications: Google Professional Cloud DevOps Engineer or equivalent

Why Join Us

  • Shape the infrastructure behind real-world AI agents used by automotive dealerships and enterprises
  • Work alongside AI developers, product engineers, and solution architects
  • Ship fast in a zero-to-one environment while building for scale
  • Own platform-level impact across reliability, security, cost, and developer productivity

How to Apply

Please send:

  • Your resume highlighting DevOps/SRE experience on GCP
  • GitHub or portfolio links showcasing infrastructure projects or CI/CD pipelines
  • (Optional) A short Loom or video describing your favorite system you’ve built or scaled

Required languages

English B2 - Upper Intermediate
CI/CD, Docker, Kubernetes, Terraform, DevOps, Prometheus+Grafana, Grafana, Python, AWS, PostgreSQL
Published 27 August
55 views
·
11 applications
91% read
·
37% responded
Last responded 2 days ago
To apply for this and other jobs on Djinni login or signup.
Loading...