Head of ML Infrastructure

Apply Now

Company: Tranzeal Incorporated

Location: Palo Alto, CA 94303

Description:

Key Responsibilities:

Orchestration Platform Development:
Architect and implement an advanced orchestration platform to manage a diverse set of LLMs efficiently.
Design solutions to optimize performance, scalability, and availability across various deployment environments.

Infrastructure Management:
Utilize Kubernetes, Terraform, and other Infrastructure as Code (IAC) tools to automate and manage ML infrastructure.
Collaborate with DevOps and cloud engineering teams to ensure seamless integration with CI/CD pipelines.
Establish robust monitoring, logging, and alerting systems for ML infrastructure.

Multi-Cloud Strategy:
Design and execute strategies to leverage multiple cloud providers for cost optimization, redundancy, and compliance.
Manage cloud-native services to support model deployment and orchestration at scale.

Performance Optimization:
Work closely with ML engineers to fine-tune model deployment strategies, focusing on latency, throughput, and fault tolerance.
Conduct capacity planning and develop tools for model lifecycle management.

Leadership & Collaboration:
Lead a team of infrastructure engineers, fostering a culture of innovation, collaboration, and excellence.
Act as a bridge between ML research, engineering, and operations teams to align infrastructure capabilities with business needs.
Stay abreast of emerging technologies and methodologies in ML infrastructure and orchestration.

Qualifications:

Technical Skills:
Proven experience in building and managing ML infrastructure platforms, particularly for LLMs or other advanced AI systems.
Expertise in Kubernetes, Terraform, and other IAC tools.
Deep understanding of multi-cloud architectures (e.g., AWS, Azure, Google Cloud) and hybrid cloud solutions.
Strong programming skills in Python, Go, or a similar language, with experience in building automation and orchestration tools.
Familiarity with modern ML frameworks and tools (e.g., TensorFlow, PyTorch, Hugging Face).

Leadership & Communication:

  • Demonstrated success in leading infrastructure teams and managing large-scale projects
  • Excellent problem-solving and decision-making skills.

Strong communication skills, with the ability to convey complex technical ideas to non-technical stakeholders.

Education & Experience:

  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent work experience).
  • 8+ years of experience in infrastructure engineering, with at least 3 years in a leadership

Similar Jobs