GPU Cluster Architect - Data Center
- Up to €200,000 gross per year
- Amsterdam, Netherlands
- Permanent
- 200000
- Telecoms
- IP Networking & Transmission
We are partnered with a fast-growing global technology organisation specialising in full-stack cloud infrastructure designed for the artificial intelligence era. Headquartered in Amsterdam and listed on Nasdaq, it builds and operates cutting-edge AI cloud platforms and large-scale GPU-powered data centres that enable developers, researchers and enterprises to train, deploy and scale AI workloads with unmatched performance and reliability. With a presence across Europe, North America and Israel, the business combines deep technical expertise with a mission to democratise access to advanced AI infrastructure, supporting innovation across sectors from life sciences to media and beyond.
We’re looking for a GPU Cluster Architect to lead the design and development of their next-generation AI infrastructure powering large-scale, GPU-accelerated workloads. In this hands-on role, you’ll own architectural decisions across compute, networking, and storage, building platforms capable of supporting the scale, performance, and reliability demands of modern AI and ML systems.
You’ll define how tens of thousands of GPUs are interconnected, powered, cooled, and optimized across multiple data center sites. Working alongside world-class engineering teams, you’ll shape the backbone of one of the most advanced AI clouds in the world.
If you’re passionate about designing ultra-scale systems, optimizing performance for LLM training and inference, and building the core infrastructure that powers AI innovation, this is your opportunity.
Responsibilities:
- Architect scalable GPU cluster topologies spanning compute nodes, interconnects (InfiniBand, Ethernet), storage, and control planes
- Model and analyze AI/ML workloads (LLM training, inference) to drive tradeoffs in latency, bandwidth, GPU density, and performance
- Collaborate with network architects to design and validate low-latency, high-throughput interconnects (InfiniBand HDR/NDR, RoCEv2) at POD and data center scale
- Integrate and optimize storage solutions to support training datasets, checkpointing, and high-performance I/O operations
- Design for reliability, incorporating telemetry, automation, and monitoring to detect and resolve issues early
- Partner with cross-functional teams including SRE, networking, storage, and data center engineering to operationalize your designs
Skills / Must Have:
- 5+ years of experience designing GPU or HPC clusters at scale
- Deep understanding of modern GPU architectures (NVIDIA, AMD)
- Expertise with HPC interconnects (InfiniBand, RoCE) and low-latency networking
- Strong background in systems architecture, compute, and hardware reliability
- Proficiency in scripting and automation (Python, Go)
Bonus If You Have:
- Experience with AI/ML workload optimization and performance modeling
- Familiarity with large-scale data center design and cooling/power strategies
- Exposure to orchestration systems (Kubernetes, Slurm) or telemetry frameworks
Benefits:
- Bonus scheme
- Company shares
- Flexible remote working
Salary:
- Up to €200,000 gross per year