Site Reliability Engineer - Systems Integrator

1658810
  • $120000-$150000
  • City of Sydney, New South Wales, Australia
  • Permanent
  • 100000
  • 150000
  • Enterprise


Ready to be the guardian of critical business data and systems?

Join a fast-growing Australian IT services firm transforming how companies protect, recover, and leverage their digital assets. From cloud and disaster recovery to cybersecurity, the team delivers smart, tailored solutions that keep clients secure, operational, and future-ready, no cookie-cutter approaches, just real impact.

If you thrive on solving high-stakes challenges and want your work to matter every day, apply now!


Key Responsibilities:

  • Design, build, and operate AI/ML production pipelines on GPU-enabled infrastructure
  • Deploy, scale, and manage inference and training workloads on Kubernetes
  • Productionize machine learning models using CI/CD and MLOps best practices
  • Manage GPU scheduling, resource allocation, and cost efficiency
  • Implement monitoring for model performance, data drift, latency, GPU utilization, and system health
  • Automate model lifecycle management and retraining workflows
  • Partner with data science teams to transition models from development to production
  • Ensure scalability, reliability, security, and compliance of AI systems
  • Troubleshoot and resolve production AI/ML and infrastructure incidents
  • Contribute to platform architecture and internal tooling decisions


Required Qualifications:

  • 3+ years of experience in AI Ops, MLOps, DevOps, or ML infrastructure roles
  • Strong Python skills and experience with ML frameworks (e.g. PyTorch, TensorFlow, scikit-learn)
  • Hands-on experience running ML workloads on GPUs (training and/or inference)
  • Strong Kubernetes experience, including deploying and operating production workloads
  • Experience with Docker and container-based ML systems
  • Experience with cloud platforms (AWS, GCP, or Azure), including GPU instances
  • Familiarity with CI/CD pipelines and infrastructure-as-code tools
  • Understanding of model monitoring, observability, and data drift
  • Strong production mindset and problem-solving skills


Preferred Skills:

  • Experience with Kubernetes GPU operators, device plugins, or scheduling strategies
  • Experience with ML platforms such as MLflow, Kubeflow, SageMaker, or Vertex AI
  • Knowledge of distributed training or large-scale inference systems
  • Experience with feature stores and modern data pipelines
  • Background in platform engineering or site reliability engineering (SRE)
  • Familiarity with security, governance, or compliance requirements for AI systems


Benefits:

  • Competitive salary and equity options
  • Hybrid work model based in Sydney
  • High-impact role with ownership over GPU-backed AI platforms
  • Opportunity to build and shape foundational AI infrastructure
  • Collaborative, engineering-led culture with room for growth


Salary:

  • $120000-$150000
Mitchell Cole Head of Cyber Security & Cloud APAC

Apply for this role