Site Reliability Engineer (DevOps) $$$$
We are looking for a dedicated SRE resource during Europe/Eastern European business hours, preferably coupled with an incident response service (MSP).
As a part of the SRE team (working REMOTELY) you will be challenged with maintaining our AI infrastructure platform outside of California (PST) business hours.
What you'll be doing:
1) Ensure reliability and scalability of our AI infrastructure platform and hybrid Linux environments.
2) Managing Linux infrastructure to ensure maximum uptime.
3) Performance and reliability testing. This may include reviewing configuration, software choices/versions, hardware specs, etc.
4) Advancing our technology stack with innovative ideas and new creative solutions.
5) Participating in capacity management of core systems and services, application analysis and performance and security tuning. Provide operational support of systems and build automation to remediate and address the root cause; with the goal of automating response to all non-exceptional service conditions.
6) Create strategies for long term permanent fixes to critical production incidents.
7) Maintain documentation, build tooling, and create alerts to both identify and address infrastructure reliability.
8) Proactively identify system anomalies.
Typical tasks might include moving clusters, troubleshooting, host unresponsive issues etc.
Technical stack:
Linux, VPNs, Cloud (AWS, Google, GPU), Kubernetes, docker
We use Java (BE) and React (FE) on our platform
Requirements:
1) CET/Eastern European time (with some overlap of PST morning hours for daily reports)
2) Best practices for architecting cross-datacenter Kubernetes clusters running on-premise with automated etcd management
3) Profound knowledge of docker (docker-shim), containerd and runc internals at the kernel level
4) Ability to manually troubleshoot and solve certificate issues within kubernetes with zero downtime
5) Thorough understanding of RPM based Linux systems.
6) Understanding of basic networking concepts ( TCP/IP stack, DNS, CDN, load balancing, BGP).
7) Ability to resolve complex merge conflicts in git is an obvious requirement
Required languages
| English | B2 - Upper Intermediate |