Senior Operations Manager - AI Infrastructure

Hypertec Group

Share this job:

125 - 150 Posted: 9 hours ago

Job Description

<h3>Senior Operations Manager - AI Infrastructure</h3>Job Category: Sales SupportRequisition Number: SENIO002199Apply now<ul><li>Posted : September 5, 2025</li><li>Full-Time</li></ul><h3>Locations</h3>Showing 1 location5C is seeking a strategic, hands-on Senior Operations Manager to lead the deployment and ongoing operations of our large-scale AI infrastructure.Mission:You will manage a cross-functional team of System Administrators, DevOps Engineers, and Support Specialists, ensuring reliability, performance, and scalability across high-performance compute clusters.The ideal candidate has direct experience managing GPU/TPU-based AI clusters designed for large-scale training and inference. You will collaborate with Networking teams and System Architects to align day-to-day operations with system design, network performance, and long-term infrastructure strategy.What You’ll Be Contributing:<ul><li>Lead, mentor, and grow a high-performing technical team.</li><li>Set clear goals, track performance, and foster a culture of accountability and continuous improvement.</li></ul>Infrastructure Operations<ul><li>Oversee deployment, scaling, and lifecycle management of GPU-based AI clusters across on-premises and cloud environments.</li><li>Ensure infrastructure performance, resilience, and cost efficiency for compute-intensive AI workloads.</li><li>Partner with Networking teams to optimize high-bandwidth, low-latency connectivity.</li><li>Work closely with System Architects to deliver scalable, maintainable infrastructure aligned with long-term goals.</li></ul>DevOps & Automation<ul><li>Champion Infrastructure-as-Code (IaC) practices for automated provisioning, configuration, and monitoring.</li><li>Drive adoption of CI/CD pipelines for reliable infrastructure and model deployments.</li></ul>System Reliability & Support<ul><li>Maintain system performance, security, and availability through proactive monitoring, patching, and support.</li><li>Lead incident response, root cause analysis, and continuous improvement initiatives.</li></ul>Strategic Planning & Budgeting<ul><li>Support capacity planning and roadmap development to meet future compute demands.</li><li>Manage budgets for hardware procurement, cloud services, and licensing.</li></ul>Compliance & Security<ul><li>Partner with Security and Networking teams to implement access controls, monitoring, and compliance standards.</li><li>Ensure adherence to regulatory and internal security policies.</li></ul>What Sets You Apart:Required:<ul><li>7+ years in infrastructure or IT operations, with 3+ years in a leadership role.</li><li>Proven experience managing high-performance computing environments (GPU/TPU clusters).</li><li>Strong expertise in Linux systems, distributed systems, and automation tooling.</li><li>Track record of collaboration with Networking teams for performance and security.</li><li>Experience aligning with System Architects to deliver scalable infrastructure.</li><li>Proficiency with DevOps tools (Terraform, Ansible, Kubernetes, Docker, CI/CD pipelines).</li><li>Familiarity with cloud platforms (AWS, GCP, Azure) and hybrid infrastructures.</li><li>Excellent leadership, communication, and organizational skills.</li></ul>Preferred:<ul><li>Direct experience managing AI clusters for deep learning training/inference.</li><li>Background in AI/ML, data science, or high-throughput data processing.</li><li>Experience with HPC and workload schedulers (Slurm, Kubernetes).</li><li>Relevant certifications (cloud, networking, DevOps).</li><li>Knowledge of observability tools (Prometheus, Grafana, ELK stack).</li></ul>Why Join Us:At the forefront of AI infrastructure innovation, you’ll play a pivotal role in scaling next-generation compute environments that power cutting-edge AI research and applications. This is an opportunity to lead high-impact operations in a rapidly growing, collaborative environment.----------------------------Note to Applicants: This recruitment is being managed by Hypertec on behalf of our partner organization, 5C. If selected, you will be hired directly by the partner company and will be joining their team. We are supporting them in identifying top talent to help scale their people operations during a period of exciting growth.About 5C Group5C Group is a next-generation AI Digital Infrastructure provider established from the acquisition of 5C Data Centers by Hypertec Cloud. With over 2 gigawatts (GW) of roadmap capacity and the ability to power hundreds of thousands of GPUs, 5C Group delivers secure, reliable, and sustainable data center and AI infrastructure solutions at scale to the largest and most demanding AI users. For more information, please visit www.5c.ai .
#J-18808-Ljbffr

Back to Listings

« Emergency Response Operator ServiceNow Solution Architect - Telecom »

Browse Jobs in Canada by City

Toronto Oakville Hamilton Markham Burlington Vaughan Mississauga North York Calgary Winnipeg Richmond Hill Oshawa

Create Your Resume First

Give yourself the best chance of success. Create a professional, job-winning resume with AI before you apply.

It's fast, easy, and increases your chances of getting an interview!

Create Resume

Application Disclaimer

You are now leaving Teerjobs.ca and being redirected to a third-party website to complete your application. We are not responsible for the content or privacy practices of this external site.

Important: Beware of job scams. Never provide your bank account details, credit card information, or any form of payment to a potential employer.

Proceed to Apply