Site Reliability Engineer - Hosting

1677663
  • Up to €130,000 gross per year
  • Remote, Europe
  • Permanent
  • 100000
  • Artificial Intelligence
  • AI Network


Curious about powering next-generation machine learning and AI workloads?

Join a, next‑generation cloud platform focused on giving AI builders and machine learning teams powerful, production‑grade compute resources and infrastructure without the barriers of traditional providers. The organisation delivers on‑demand GPU compute, clusters, serverless inference, and scalable environments designed for complex AI‑driven workloads while upholding strong standards around performance, sustainability, and developer simplicity. Benefit from a true automation first culture, the ability to shape tooling and operational standards in an early stage platform, and hands on exposure to high performance AI infrastructure at scale.

Step into a role driving AI and cloud innovation, apply today!


Responsibilities:

  • Design and build automation for Linux based GPU clusters
  • Write scripts and tooling in Bash and Python
  • Improve system reliability, monitoring and incident response
  • Support AI training environments using Kubernetes, Slurm and Docker
  • Act as a point of contact during incidents and drive resolution
  • Identify and automate manual operational processes
  • Work closely with infrastructure and hardware teams
  • Contribute to future platform evolution including serverless compute


Skills/Must have:

  • Strong experience as an SRE or similar role
  • Deep knowledge of Linux systems and operations
  • Strong scripting or coding skills in Bash and Python
  • Experience with Kubernetes, Docker and cluster level tooling
  • Understanding of HPC or AI workloads and multi node training
  • Experience in high availability, high pressure environments
  • Automation first mindset with interest in using AI tools


Salary:

  • Up to €130,000 gross per year 
Holly Staff Head of AI & Data Center Benelux

Apply for this role