Site Reliability Engineer (DevOps) $$$$

We are looking for a dedicated SRE resource during Europe/Eastern European business hours, preferably coupled with an incident response service (MSP).

 

As a part of the SRE team (working REMOTELY) you will be challenged with maintaining our AI infrastructure platform outside of California (PST) business hours.
 

What you'll be doing:

 

1) Ensure reliability and scalability of our AI infrastructure platform and hybrid Linux environments.

 

2) Managing Linux infrastructure to ensure maximum uptime.

 

3) Performance and reliability testing. This may include reviewing configuration, software choices/versions, hardware specs, etc.

 

4) Advancing our technology stack with innovative ideas and new creative solutions.

 

5) Participating in capacity management of core systems and services, application analysis and performance and security tuning. Provide operational support of systems and build automation to remediate and address the root cause; with the goal of automating response to all non-exceptional service conditions.

 

6) Create strategies for long term permanent fixes to critical production incidents.

 

7) Maintain documentation, build tooling, and create alerts to both identify and address infrastructure reliability.

 

8) Proactively identify system anomalies.

 

Typical tasks might include moving clusters, troubleshooting, host unresponsive issues etc.

 

Technical stack:

 

Linux, VPNs, Cloud (AWS, Google, GPU), Kubernetes, docker

We use Java (BE) and React (FE) on our platform 

 

Requirements:

 

1) CET/Eastern European time (with some overlap of PST morning hours for daily reports)

 

2) Best practices for architecting cross-datacenter Kubernetes clusters running on-premise with automated etcd management

 

3) Profound knowledge of docker (docker-shim), containerd and runc internals at the kernel level

 

4) Ability to manually troubleshoot and solve certificate issues within kubernetes with zero downtime

 

5) Thorough understanding of RPM based Linux systems.

 

6) Understanding of basic networking concepts ( TCP/IP stack, DNS, CDN, load balancing, BGP).

 

7) Ability to resolve complex merge conflicts in git is an obvious requirement

Required languages

English B2 - Upper Intermediate
Published 25 March
42 views
ยท
10 applications
To apply for this and other jobs on Djinni login or signup.
Loading...