Senior Site Reliability Engineer (GPU Clusters) - Hosting

1711180
  • Circa $250,000 base salary
  • San Francisco, California, United States
  • Permanent
  • 250000
  • Artificial Intelligence
  • AI Network
  • AI Software


Looking for a role with plenty of growth opportunities?

Join a rapidly scaling AI cloud infrastructure provider building a next-generation GPU platform designed for AI training, experimentation, and inference at scale. The company is developing a fully featured AI cloud platform powered by renewable energy and is already operating with strong momentum across Europe, while now significantly expanding its footprint in the United States.

The company is looking for a Senior / Staff Site Reliability Engineer to support and scale large-scale HPC and cloud environments powering GPU-intensive workloads. The role involves working closely with platform, ML, and infrastructure teams to improve reliability, automation, and observability across distributed compute environments while supporting long-term infrastructure growth and scalability.

Don’t miss out on this exciting opportunity and apply today!


Responsibilities:

  • Ensure the reliability, scalability, and performance of HPC and cloud infrastructure environments
  • Design, build, and maintain automation, observability, and monitoring frameworks for GPU compute clusters
  • Collaborate with ML, data, and platform engineering teams to deliver highly available infrastructure systems
  • Improve CI/CD pipelines, deployment workflows, and operational tooling
  • Contribute to infrastructure architecture discussions and long-term platform strategy
  • Diagnose performance bottlenecks across distributed systems and HPC workloads
  • Support and optimize Slurm-based GPU cluster environments
  • Participate in an on-call rotation supporting mission-critical infrastructure operations


Skills/Must Have:

  • Deep experience in Site Reliability Engineering, DevOps, Infrastructure Engineering, or related fields
  • Strong experience supporting HPC or large-scale distributed compute environments
  • Deep Linux expertise (Ubuntu/Debian preferred)
  • Strong scripting and automation skills using Python, Go, or Bash
  • Hands-on experience with public cloud platforms or modern GPU cloud providers
  • Strong understanding of networking fundamentals (DNS, TCP/IP, routing, performance optimization)
  • Experience with Infrastructure-as-Code tooling such as Terraform and Ansible
  • Proven experience operating Slurm-based GPU/HPC clusters
  • Ability to troubleshoot distributed systems and optimize workload scheduling/performance


Benefits:

  • Stock options
  • Bonus 
  • Remote working option and allowance 


Salary:

  • Circa $250,000 base salary 
Ben Davies Director Global AI Infrastructure

Apply for this role