Site Reliability Engineer - Systems Integrator

1658810 Posted: 25/02/2026

$120000-$150000
City of Sydney, New South Wales, Australia
Permanent
100000
150000
Enterprise

Ready to be the guardian of critical business data and systems?

Join a fast-growing Australian IT services firm transforming how companies protect, recover, and leverage their digital assets. From cloud and disaster recovery to cybersecurity, the team delivers smart, tailored solutions that keep clients secure, operational, and future-ready, no cookie-cutter approaches, just real impact.

If you thrive on solving high-stakes challenges and want your work to matter every day, apply now!

Key Responsibilities:

Design, build, and operate AI/ML production pipelines on GPU-enabled infrastructure
Deploy, scale, and manage inference and training workloads on Kubernetes
Productionize machine learning models using CI/CD and MLOps best practices
Manage GPU scheduling, resource allocation, and cost efficiency
Implement monitoring for model performance, data drift, latency, GPU utilization, and system health
Automate model lifecycle management and retraining workflows
Partner with data science teams to transition models from development to production
Ensure scalability, reliability, security, and compliance of AI systems
Troubleshoot and resolve production AI/ML and infrastructure incidents
Contribute to platform architecture and internal tooling decisions

Required Qualifications:

3+ years of experience in AI Ops, MLOps, DevOps, or ML infrastructure roles
Strong Python skills and experience with ML frameworks (e.g. PyTorch, TensorFlow, scikit-learn)
Hands-on experience running ML workloads on GPUs (training and/or inference)
Strong Kubernetes experience, including deploying and operating production workloads
Experience with Docker and container-based ML systems
Experience with cloud platforms (AWS, GCP, or Azure), including GPU instances
Familiarity with CI/CD pipelines and infrastructure-as-code tools
Understanding of model monitoring, observability, and data drift
Strong production mindset and problem-solving skills

Preferred Skills:

Experience with Kubernetes GPU operators, device plugins, or scheduling strategies
Experience with ML platforms such as MLflow, Kubeflow, SageMaker, or Vertex AI
Knowledge of distributed training or large-scale inference systems
Experience with feature stores and modern data pipelines
Background in platform engineering or site reliability engineering (SRE)
Familiarity with security, governance, or compliance requirements for AI systems

Benefits:

Competitive salary and equity options
Hybrid work model based in Sydney
High-impact role with ownership over GPU-backed AI platforms
Opportunity to build and shape foundational AI infrastructure
Collaborative, engineering-led culture with room for growth

Salary:

$120000-$150000

Mitchell Cole Head of Cyber Security & Cloud APAC

Apply for this role

First Name

Last Name

Telephone Number

Email Address

CV, LinkedIn or Dropbox URL

CV Upload

Choose File

LinkedIn / Dropbox URL

Message

By submitting this form you agree to our Terms & Conditions, Privacy Policy & Cookie Policy.

Quick CV Dropoff

Site Reliability Engineer - Systems Integrator

Apply for this role

Featured Jobs

Contact Us

Find us on social

Useful Links

Legal

Site Reliability Engineer - Systems Integrator

Apply for this role

Featured Jobs

Contact Us

Find us on social

Useful Links

Legal

Sign up to our newsletter