HPC Cluster Engineer - AI Infrastructure

1706706
  • Competitive and based on experience.
  • Amsterdam, Netherlands
  • Permanent
  • Artificial Intelligence
  • AI Software


Ready to take the next step in your career?

Join a provider of AI cloud infrastructure delivering full-stack platforms for developers, enterprises, and research institutions to build and deploy generative AI applications. The organisation enables teams to train and run machine learning models in a secure, high-performance, and cost-efficient cloud environment, supporting faster innovation and scientific progress.

The company is seeking a Cloud Infrastructure Engineer to support a hyperscaler platform for GPU-accelerated and AI workloads. The role focuses on improving virtualization and system performance across large-scale infrastructure. The role involves collaboration with specialists in high-performance computing and exposure to technologies such as RDMA, RoCE, Infiniband, and QEMU/KVM within a fast-paced, innovation-driven environment.

Don’t miss out on this exciting opportunity and apply today!


Responsibilities:

  • Improve infrastructure supporting GPU-accelerated computing.
  • Analyze root causes of performance and reliability issues across various scales and suggest effective solutions.
  • Add support for new hardware across the infrastructure software stack.
  • Proactively detect and resolve issues to ensure platform stability and efficiency.


Skills/Must Have:

  • 5+ years of professional software development experience.
  • 3+ years working with Linux systems.
  • Strong system-level understanding of server architecture, PCIe devices, NICs, and kernel drivers.
  • Proficiency in performance-oriented programming languages (e.g., C, C++, Go, Java, Python).


Desirable Skills:

  • Experience tuning performance for HPC workloads.
  • Familiarity with RDMA, RoCE, and Infiniband networking.
  • Knowledge of Software Defined Networking and HPC cluster networking.
  • Understanding of the QEMU/KVM virtualization stack.
  • Experience with deep learning frameworks (e.g., PyTorch, TensorFlow).
  • Familiarity with collective communication libraries (e.g., MPI, NCCL).
  • Willingness to complete a coding interview as part of the hiring process.


Benefits:

  • Competitive salary and full benefits package.
  • Opportunities for professional growth and internal mobility.
  • Hybrid work environment with flexibility.
  • Collaborative and forward-thinking engineering culture.


Salary:

  • Competitive and based on experience.
Holly Staff Head of AI & Data Center Benelux

Apply for this role