DevOps / SRE / AIOps Engineer – Google Cloud
JUTEQ is an AI-native and cloud-native technology consulting firm helping enterprises in financial services, telecom, and healthcare build intelligent, production-grade systems. We combine the power of GenAI, cloud architecture, and automation to deliver next-generation business tools.
We’re seeking a DevOps/Site Reliability Engineer (SRE) with experience in Google Cloud Platform (GCP) to lead and evolve our AI infrastructure as we scale multi-tenant agentic systems across automotive and enterprise use cases. This is a hands-on role working at the intersection of automation, observability, and production AI system reliability.
What You’ll Work On
Platform Reliability & Automation
- Own deployment pipelines, autoscaling, and high-availability for AI microservices running on GCP (Cloud Run, GKE, App Engine)
- Design and optimize CI/CD pipelines using Cloud Build, Skaffold, GitHub Actions
- Implement intelligent autoscaling strategies based on LLM cost, latency, and throughput
- Use Infrastructure as Code (Terraform, Deployment Manager) for repeatable cloud provisioning
Monitoring & Observability
- Deploy monitoring and alerting across Cloud Logging, Cloud Monitoring, and custom dashboards for agent performance metrics
- Define SLOs and SLIs for key services; implement failover and rollback strategies
- Build observability into agent workflows: latency, success rate, AI token consumption, prompt drift, etc.
Data & AI Infrastructure
- Manage access, scaling, and resilience of data services: BigQuery, Firestore, Memorystore, Cloud Storage, Pub/Sub
- Support model integration workflows with Vertex AI and third-party LLM providers (OpenAI, Anthropic, etc.)
- Monitor and secure retrieval pipelines (RAG, embedding generation, vector DBs)
Security & Compliance
- Implement and maintain IAM policies, workload identity, and service-to-service authentication
- Lead incident response and postmortem analysis for production outages
- Ensure systems comply with data residency, privacy, and SOC2/GDPR requirements
What We’re Looking For
Experience & Skills
- 4+ years of DevOps or SRE experience, with at least 2+ years on GCP
- Strong understanding of GCP products including Cloud Run, GKE, Cloud Build, BigQuery, Pub/Sub, Cloud Monitoring
- Experience with CI/CD and GitOps workflows (GitHub Actions, ArgoCD, etc.) and Observability/Monitoring
- Deep knowledge of containerization, Docker, and Kubernetes
- Familiarity with AI infrastructure (LLMs, prompt evaluation, LangChain/CrewAI patterns) is a strong plus
- Experience with alerting and logging using Prometheus, Grafana, or GCP-native tools
- Proficient in scripting (Python, Bash, Go preferred)
Bonus Points
- Experience managing infrastructure for AI agent systems or GenAI workloads
- Familiarity with multi-tenant SaaS platforms
- Understanding of RAG pipelines, embedding generation, or agent orchestration
- Certifications: Google Professional Cloud DevOps Engineer or equivalent
Why Join Us
- Shape the infrastructure behind real-world AI agents used by automotive dealerships and enterprises
- Work alongside AI developers, product engineers, and solution architects
- Ship fast in a zero-to-one environment while building for scale
- Own platform-level impact across reliability, security, cost, and developer productivity
How to Apply
Please send:
- Your resume highlighting DevOps/SRE experience on GCP
- GitHub or portfolio links showcasing infrastructure projects or CI/CD pipelines
- (Optional) A short Loom or video describing your favorite system you’ve built or scaled
Required languages
English | B2 - Upper Intermediate |