The company is looking for a Senior / Staff Site Reliability Engineer to support and scale large-scale HPC and cloud environments powering GPU-intensive workloads. The role involves working closely with platform, ML, and infrastructure teams to improve reliability, automation, and observability across distributed compute environments while supporting long-term infrastructure growth and scalability.

Don’t miss out on this exciting opportunity and apply today!

Responsibilities:

Ensure the reliability, scalability, and performance of HPC and cloud infrastructure environments
Design, build, and maintain automation, observability, and monitoring frameworks for GPU compute clusters
Collaborate with ML, data, and platform engineering teams to deliver highly available infrastructure systems
Improve CI/CD pipelines, deployment workflows, and operational tooling
Contribute to infrastructure architecture discussions and long-term platform strategy
Diagnose performance bottlenecks across distributed systems and HPC workloads
Support and optimize Slurm-based GPU cluster environments
Participate in an on-call rotation supporting mission-critical infrastructure operations

Skills/Must Have:

Deep experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, or related fields
Strong experience supporting HPC or large-scale distributed compute environments
Deep Linux expertise (Ubuntu/Debian preferred)
Strong scripting and automation skills using Python, Go, or Bash
Hands-on experience with public cloud platforms or modern GPU cloud providers
Strong understanding of networking fundamentals (DNS, TCP/IP, routing, performance optimization)
Experience with Infrastructure-as-Code tooling such as Terraform and Ansible
Proven experience operating Slurm-based GPU/HPC clusters
Ability to troubleshoot distributed systems and optimize workload scheduling/performance

Benefits:

Stock options
Bonus
Remote working option and allowance

Salary:

Circa $250,000 base salary

Quick CV Dropoff

Senior Site Reliability Engineer (GPU Clusters) - Hosting

Apply for this role

Featured Jobs

Contact Us

Find us on social

Useful Links

Legal