Infrastructure Engineer (AI) - Hosting
- S$200,000 base per year
- Adam Park, Singapore
- Permanent
- 200000
- Artificial Intelligence
- AI Network
- AI Software
Looking to advance your career in AI and high-performance computing while working with next-generation GPU infrastructure?
Join a technology team that provides scalable GPU computing solutions and global infrastructure for AI and compute-intensive workloads. The team focuses on simplifying access to high-performance systems, allowing engineers to deploy, manage, and optimize resources efficiently across multiple environments. Team members work on real-world projects, collaborating closely with experienced professionals in a fast-paced, innovative environment. This role offers unparalleled exposure to the latest AI technologies, the chance to work with industry-leading customers, and the ability to make a tangible impact on the future of AI.
Apply now to grow your expertise and play a key role in shaping the future of GPU infrastructure and AI computing!
Responsibilities:
- Get AI Platform customers production-ready on the platform, standing up Kubernetes clusters, configuring GPU drivers, validating networking, and troubleshooting the issues that surface when real workloads hit real hardware.
- Own the bare metal platform layer (NCCL, InfiniBand, NVLink, storage) with orchestration layers (Kubernetes, SLURM) and MLOps tooling that customers actually use.
- Configure, benchmark, and debug NVIDIA driver stacks, firmware versions, CUDA compatibility, NCCL tuning, MIG configurations.
- Run quality benchmarks and diagnostics to validate performance for inference and training workloads across chip types.
- Identify gaps before customers do, pressure-testing the infrastructure, APIs, and workflows to find what's missing or broken.
- Turn customer learnings into product, working with Product and Engineering to build reusable templates, default configurations, and automated workflows that eliminate manual onboarding.
- Advise customers on chip selection and tokenomics, helping AI platform customers understand price/performance trade-offs across GPU types, cost-per-token economics, and which hardware fits their inference or training workloads.
Skills/Must have:
- Bare metal Linux depth: experience administering GPU servers at the metal: driver stacks, kernel tuning, firmware, storage configuration.
- NVIDIA GPU stack expertise: drivers, CUDA, NCCL, NVLink, nvidia-smi profiling.
- Good understanding of how stack compatibility affects performance.
- Kubernetes and orchestration: production experience with K8s, SLURM, or similar. You know how to stand up clusters, not just deploy to them.
- AI Networking fundamentals: TCP/IP, VLANs, bonding, and high-speed interconnects (InfiniBand, RoCE) for distributed workloads.
- Customer-facing communication: work directly with engineers at AI platform companies, understand their constraints, and translate that into clear requirements for your team.
- Bias toward scalable solutions: you'd rather build a feature that helps 10 customers than a custom deployment that helps
Benefits:
- Comprehensive health, dental, and vision insurance, 401 (k) with employer matching
- 10/15 % bonus
- Equity options
Salary:
- S$200,000 base per year