GPU Cluster Architect - Technology and Cloud Infrastructure Provider

1651455
  • Up to €200,000 gross per year
  • Amsterdam [Netherlands]
  • Permanent
  • 200000
  • Artificial Intelligence
  • AI Network


Our client is a global technology and cloud infrastructure provider specialising in high-performance platforms designed to support AI and machine-learning workloads. Operating as a next-generation cloud service provider, they deliver large-scale compute, GPU-accelerated environments and managed services that enable organisations to build, train and deploy advanced applications at scale. With a strong international footprint across Europe and North America, the business combines cutting-edge infrastructure with developer-focused tools to provide secure, scalable and cost-effective access to AI-ready cloud solutions.

We’re looking for a GPU Cluster Architect to join them and lead the design and development of next-generation AI infrastructure powering large-scale, GPU-accelerated workloads. In this hands-on role, you’ll own architectural decisions across compute, networking, and storage, building platforms capable of supporting the scale, performance, and reliability demands of modern AI and ML systems.

You’ll define how tens of thousands of GPUs are interconnected, powered, cooled, and optimized across multiple data center sites. Working alongside world-class engineering teams, you’ll shape the backbone of one of the most advanced AI clouds in the world.

If you’re passionate about designing ultra-scale systems, optimizing performance for LLM training and inference, and building the core infrastructure that powers AI innovation, this is your opportunity.

Responsibilities:

  • Architect scalable GPU cluster topologies spanning compute nodes, interconnects (InfiniBand, Ethernet), storage, and control planes
  • Model and analyze AI/ML workloads (LLM training, inference) to drive tradeoffs in latency, bandwidth, GPU density, and performance
  • Collaborate with network architects to design and validate low-latency, high-throughput interconnects (InfiniBand HDR/NDR, RoCEv2) at POD and data center scale
  • Integrate and optimize storage solutions to support training datasets, checkpointing, and high-performance I/O operations
  • Design for reliability, incorporating telemetry, automation, and monitoring to detect and resolve issues early
  • Partner with cross-functional teams including SRE, networking, storage, and data center engineering to operationalize your designs

Skills / Must Have:

  • 5+ years of experience designing GPU or HPC clusters at scale
  • Deep understanding of modern GPU architectures (NVIDIA, AMD)
  • Expertise with HPC interconnects (InfiniBand, RoCE) and low-latency networking
  • Strong background in systems architecture, compute, and hardware reliability
  • Proficiency in scripting and automation (Python, Go)

Bonus If You Have:

  • Experience with AI/ML workload optimization and performance modeling
  • Familiarity with large-scale data center design and cooling/power strategies
  • Exposure to orchestration systems (Kubernetes, Slurm) or telemetry frameworks

Benefits:

  • Bonus scheme 
  • Company shares
  • Flexible remote working

Salary:

  • Up to €200,000 gross per year 



Holly Staff Principal Network Consultant BLX

Apply for this role