Site Reliability Engineer (EU Remote) - AI infrastructure

1658763 Posted: 23/01/2026

€200,000 gross per year
Amstelveen, Netherlands
Permanent
200000
Artificial Intelligence
AI Network
AI Software

Join a seed-stage AI infrastructure company building large-scale training and inference platforms previously accessible only to hyperscalers. The business began with a single managed GPU cluster that reached capacity almost immediately and has since expanded into a global platform spanning infrastructure, networking, and orchestration.

They are now seeking a Site Reliability Engineer to join their Eu operations in this remote role. This is ideal if you have 7 years of experience in SRE, DevOps, or Infrastructure Engineering roles and have had exposure to supporting large-scale compute environments.

If you are interested in this exciting opportunity, get in touch and apply today!

Responsibilities:

Design, deploy, and maintain large-scale GPU clusters (H100/H200/B200) for training and inference workloads.
Build automation pipelines for provisioning, scaling, and monitoring compute resources across Slurm and Kubernetes environments.
Develop observability, alerting, and auto-healing systems for high-availability GPU workloads.
Collaborate with ML, networking, and platform teams to optimise resource scheduling, GPU utilisation, and data flow.
Implement infrastructure-as-code, CI/CD pipelines, and reliability standards across thousands of nodes.
Diagnose performance bottlenecks and drive continuous improvements in reliability, latency, and throughput.

Skills / Must Have:

7+ years of experience in SRE, DevOps, or Infrastructure Engineering roles supporting large-scale compute environments.
Strong hands-on experience with Kubernetes and Slurm for cluster orchestration and workload management.
Deep knowledge of Linux systems, networking, and GPU infrastructure (NVIDIA H100/H200/B200 preferred).
Proficiency in Python, Go, or Bash for automation, tooling, and performance tuning.
Experience with observability stacks (Prometheus, Grafana, Loki) and incident response frameworks.
Familiarity with high-performance computing (HPC) or AI/ML training infrastructure at scale.
Background in reliability engineering, distributed systems, or hardware acceleration environments is a strong plus.

Benefits:

IPO Equity

Salary:

€200,000+ gross per year

Ben Davies Director Global AI Infrastructure

Apply for this role

First Name

Last Name

Telephone Number

Email Address

CV, LinkedIn or Dropbox URL

CV Upload

Choose File

LinkedIn / Dropbox URL

Message

By submitting this form you agree to our Terms & Conditions, Privacy Policy & Cookie Policy.

Quick CV Dropoff

Site Reliability Engineer (EU Remote) - AI infrastructure

Apply for this role

Featured Jobs

Contact Us

Find us on social

Useful Links

Legal

Site Reliability Engineer (EU Remote) - AI infrastructure

Apply for this role

Featured Jobs

Contact Us

Find us on social

Useful Links

Legal

Sign up to our newsletter