AI/ML Infra Engineer - Hosting

1714770 Posted: 22/06/2026

$250,000 base salary
San Francisco, California, United States
Permanent
250000
Artificial Intelligence
AI Software

Ready to take the next step in your career?

Join a rapidly growing AI cloud infrastructure provider building high-performance compute platforms for large-scale AI training and inference workloads. With expanding GPU infrastructure across Europe and the United States, the organisation enables AI teams to access scalable compute environments without traditional infrastructure limitations.

As a Senior ML Infrastructure Engineer, the successful candidate will help build and scale Kubernetes-based machine learning platforms supporting large-scale training and inference systems. The role focuses on workload orchestration, GPU scheduling, inference optimisation, and distributed systems reliability, working alongside highly technical teams at the intersection of machine learning, cloud infrastructure, and high-performance computing.

If you would like to learn more about this opportunity, feel free to reach out and apply today!

Responsibilities:

Build and scale internal ML infrastructure platforms focused on AI training and inference workloads
Develop systems for workload orchestration, job scheduling, and reliable execution across Kubernetes environments
Improve and maintain inference infrastructure, including model packaging, deployment, and serving optimisation
Collaborate with infrastructure and platform teams to maximise GPU utilisation, hardware performance, and operational reliability
Design scalable systems and reusable platform capabilities that improve developer experience and operational efficiency
Support CI/CD, GitOps, and infrastructure automation workflows across ML platform environments
Troubleshoot GPU performance, distributed systems behaviour, networking, and storage bottlenecks
Contribute to platform architecture discussions and long-term infrastructure scalability initiatives

Skills/Must Have:

Strong ML engineering background with hands-on experience supporting both training and inference infrastructure
Experience with infrastructure engineering, platform engineering, or software engineering environments
Strong programming skills in Python (Go experience is a plus)
Deep experience with Kubernetes, including operators, CRDs, workload orchestration, and GPU scheduling
Comfortable operating in Linux environments and debugging GPU-related issues, including CUDA, drivers, networking, and filesystems
Strong systems thinking and ability to design scalable, reliable, distributed infrastructure
Experience with CI/CD pipelines, GitOps workflows, and infrastructure automation

Desirable Skills:

Familiarity with orchestration and scheduling platforms such as Kueue, Flyte, Ray, or Slurm
Experience with PyTorch or JAX environments
Hands-on experience deploying inference workloads using vLLM, SGLang, TensorRT-LLM, or Triton
Knowledge of GPU networking and performance optimisation, including InfiniBand, NVLink, and NCCL
Experience working within HPC or large-scale distributed systems environments

Benefits:

Stock options

Salary:

$250,000 base salary

Ben Davies Director Global AI Infrastructure

Apply for this role

First Name

Last Name

Telephone Number

Email Address

CV, LinkedIn or Dropbox URL

CV Upload

Choose File

LinkedIn / Dropbox URL

Message

By submitting this form you agree to our Terms & Conditions, Privacy Policy & Cookie Policy.

Quick CV Dropoff

AI/ML Infra Engineer - Hosting

Apply for this role

Featured Jobs

Contact Us

Find us on social

Useful Links

Legal

AI/ML Infra Engineer - Hosting

Apply for this role

Featured Jobs

Contact Us

Find us on social

Useful Links

Legal

Sign up to our newsletter