AI/ML Infra Engineer - Hosting

1714770
  • $250,000 base salary
  • San Francisco, California, United States
  • Permanent
  • 250000
  • Artificial Intelligence
  • AI Software


Ready to take the next step in your career?

Join a rapidly growing AI cloud infrastructure provider building high-performance compute platforms for large-scale AI training and inference workloads. With expanding GPU infrastructure across Europe and the United States, the organisation enables AI teams to access scalable compute environments without traditional infrastructure limitations.

As a Senior ML Infrastructure Engineer, the successful candidate will help build and scale Kubernetes-based machine learning platforms supporting large-scale training and inference systems. The role focuses on workload orchestration, GPU scheduling, inference optimisation, and distributed systems reliability, working alongside highly technical teams at the intersection of machine learning, cloud infrastructure, and high-performance computing.

If you would like to learn more about this opportunity, feel free to reach out and apply today!


Responsibilities:

  • Build and scale internal ML infrastructure platforms focused on AI training and inference workloads
  • Develop systems for workload orchestration, job scheduling, and reliable execution across Kubernetes environments
  • Improve and maintain inference infrastructure, including model packaging, deployment, and serving optimisation
  • Collaborate with infrastructure and platform teams to maximise GPU utilisation, hardware performance, and operational reliability
  • Design scalable systems and reusable platform capabilities that improve developer experience and operational efficiency
  • Support CI/CD, GitOps, and infrastructure automation workflows across ML platform environments
  • Troubleshoot GPU performance, distributed systems behaviour, networking, and storage bottlenecks
  • Contribute to platform architecture discussions and long-term infrastructure scalability initiatives


Skills/Must Have:

  • Strong ML engineering background with hands-on experience supporting both training and inference infrastructure
  • Experience with infrastructure engineering, platform engineering, or software engineering environments
  • Strong programming skills in Python (Go experience is a plus)
  • Deep experience with Kubernetes, including operators, CRDs, workload orchestration, and GPU scheduling
  • Comfortable operating in Linux environments and debugging GPU-related issues, including CUDA, drivers, networking, and filesystems
  • Strong systems thinking and ability to design scalable, reliable, distributed infrastructure
  • Experience with CI/CD pipelines, GitOps workflows, and infrastructure automation


Desirable Skills:

  • Familiarity with orchestration and scheduling platforms such as Kueue, Flyte, Ray, or Slurm
  • Experience with PyTorch or JAX environments
  • Hands-on experience deploying inference workloads using vLLM, SGLang, TensorRT-LLM, or Triton
  • Knowledge of GPU networking and performance optimisation, including InfiniBand, NVLink, and NCCL
  • Experience working within HPC or large-scale distributed systems environments


Benefits:

  • Stock options


Salary:

  • $250,000 base salary 
Ben Davies Director Global AI Infrastructure

Apply for this role