AI Infrastructure Engineer ( Bare Metal) - Hosting

1674717
  • $200,000 base per year
  • Albany, New York, United States
  • Permanent
  • 200000
  • Artificial Intelligence
  • AI Network
  • AI Software


Are you ready to advance your career in AI and high-performance computing while working with next-generation GPU infrastructure?

Join a technology team that provides scalable GPU computing solutions and global infrastructure for AI and compute-intensive workloads. The team focuses on simplifying access to high-performance systems, allowing engineers to deploy, manage, and optimize resources efficiently across multiple environments. Team members work on real-world projects, collaborating closely with experienced professionals in a fast-paced, innovative environment. This role offers hands-on experience with modern compute platforms, exposure to cloud and bare metal systems, and opportunities to contribute to solutions that support advanced AI workloads worldwide.

Apply now to grow your expertise and play a key role in shaping the future of GPU infrastructure and AI computing!


Responsibilities:

  • Get AI Platform customers production-ready on the platform — standing up Kubernetes clusters, configuring GPU drivers, validating networking, and troubleshooting the issues that surface when real workloads hit real hardware.
  • Own the bare metal platform layer (NCCL, InfiniBand, NVLink, storage) with orchestration layers (Kubernetes, SLURM) and MLOps tooling that customers actually use.
  • Configure, benchmark, and debug NVIDIA driver stacks — firmware versions, CUDA compatibility, NCCL tuning, MIG configurations. 
  • Run quality benchmarks and diagnostics to validate performance for inference and training workloads across chip types.
  • Identify gaps before customers do, pressure-testing the infrastructure, APIs, and workflows to find what's missing or broken.
  • Turn customer learnings into product. working with Product and Engineering to build reusable templates, default configurations, and automated workflows that eliminate manual onboarding.
  • Advise customers on chip selection and tokenomics, helping AI platform customers understand price/performance trade-offs across GPU types, cost-per-token economics, and which hardware fits their inference or training workloads.


Skills/Must have:

  • Bare metal Linux depth, you've administered GPU servers at the metal: driver stacks, kernel tuning, firmware, storage configuration. 
  • NVIDIA GPU stack expertise, drivers, CUDA, NCCL, NVLink, nvidia-smi profiling. 
  • You understand how stack compatibility affects performance.
  • Kubernetes and orchestration, production experience with K8s, SLURM, or similar. You know how to stand up clusters, not just deploy to them.
  • AI Networking fundamentals, TCP/IP, VLANs, bonding, and high-speed interconnects (InfiniBand, RoCE) for distributed workloads.
  • Customer-facing communication, you can work directly with engineers at AI platform companies, understand their constraints, and translate that into clear requirements for your team.
  • Bias toward scalable solutions, you'd rather build a feature that helps 10 customers than a custom deployment that helps


Benefits:

  • Comprehensive health, dental, and vision insurance, 401 (k) with employer matching
  • 10/15 % bonus
  • Equity options


Salary:

  • $200,000 base per year
Ben Davies Director Global AI Infrastructure

Apply for this role