Infrastructure Product Engineer - AI Infrastructure

1666626
  • $300,000 to $350,000 gross per year
  • San Francisco, California
  • Permanent
  • 300000
  • Artificial Intelligence
  • AI Network
  • AI Software


Join a stealth-mode startup building a next-generation AI and cloud platform powered by thousands of H100s, H200s, and B200s, designed for rapid experimentation, full-scale model training, and production inference. As a Senior Infrastructure Product Engineer, you’ll sit at the intersection of platform architecture, product thinking, and large-scale systems engineering,  shaping how AI infrastructure is exposed, consumed, and scaled.

This role goes beyond keeping systems running. You’ll architect the underlying primitives that power new infrastructure products, defining how compute, networking, scheduling, and observability come together as a coherent platform. You’ll work closely with product, ML, and hardware teams to turn raw GPU capacity into reliable, developer-friendly capabilities.

If you want to architect infrastructure as a product, define the building blocks behind frontier AI platforms, and influence how thousands of GPUs are consumed at scale, this is a rare chance to do it from first principles.  

Get in touch and apply today! 


Responsibilities:

  • Architect and evolve large-scale GPU platforms (H100/H200/B200) to support training, inference, and emerging AI workloads.
  • Design infrastructure abstractions and platform primitives that enable new AI and cloud products.
  • Build scalable automation frameworks for provisioning, scheduling, and lifecycle management across Slurm, Kubernetes, and bare-metal environments.
  • Partner with product and ML teams to translate user requirements into infrastructure architecture and platform capabilities.
  • Define reliability, scalability, and performance standards as architectural constraints rather than reactive fixes.
  • Develop observability and capacity models that inform platform design, roadmap decisions, and customer-facing SLAs.
  • Identify systemic bottlenecks across compute, network, and storage layers and drive architectural improvements.

Skills/Must have:

  • 7+ years of experience in Infrastructure Engineering, Platform Engineering, SRE, or Systems Architecture roles.
  • Proven experience designing and operating large-scale GPU or HPC platforms.
  • Deep hands-on expertise with Kubernetes and Slurm, including scheduler behaviour and workload optimisation.
  • Strong Linux systems and networking fundamentals in high-performance environments.
  • Proficiency in Python, Go, or Bash for building platform tooling and automation.
  • Experience treating infrastructure as a product, with a focus on usability, interfaces, and scalability.
  • Familiarity with observability platforms (Prometheus, Grafana, Loki) and performance analysis at scale.

Benefits:

  • Equity 

Salary:

  • $300,000  to $350,000 gross per year


Ben Davies Director Global AI Infrastructure

Apply for this role