Head of ML Infrastructure
Apply NowCompany: Tranzeal Incorporated
Location: Palo Alto, CA 94303
Description:
Key Responsibilities:
Orchestration Platform Development:
Architect and implement an advanced orchestration platform to manage a diverse set of LLMs efficiently.
Design solutions to optimize performance, scalability, and availability across various deployment environments.
Infrastructure Management:
Utilize Kubernetes, Terraform, and other Infrastructure as Code (IAC) tools to automate and manage ML infrastructure.
Collaborate with DevOps and cloud engineering teams to ensure seamless integration with CI/CD pipelines.
Establish robust monitoring, logging, and alerting systems for ML infrastructure.
Multi-Cloud Strategy:
Design and execute strategies to leverage multiple cloud providers for cost optimization, redundancy, and compliance.
Manage cloud-native services to support model deployment and orchestration at scale.
Performance Optimization:
Work closely with ML engineers to fine-tune model deployment strategies, focusing on latency, throughput, and fault tolerance.
Conduct capacity planning and develop tools for model lifecycle management.
Leadership & Collaboration:
Lead a team of infrastructure engineers, fostering a culture of innovation, collaboration, and excellence.
Act as a bridge between ML research, engineering, and operations teams to align infrastructure capabilities with business needs.
Stay abreast of emerging technologies and methodologies in ML infrastructure and orchestration.
Qualifications:
Technical Skills:
Proven experience in building and managing ML infrastructure platforms, particularly for LLMs or other advanced AI systems.
Expertise in Kubernetes, Terraform, and other IAC tools.
Deep understanding of multi-cloud architectures (e.g., AWS, Azure, Google Cloud) and hybrid cloud solutions.
Strong programming skills in Python, Go, or a similar language, with experience in building automation and orchestration tools.
Familiarity with modern ML frameworks and tools (e.g., TensorFlow, PyTorch, Hugging Face).
Leadership & Communication:
Strong communication skills, with the ability to convey complex technical ideas to non-technical stakeholders.
Education & Experience:
Orchestration Platform Development:
Architect and implement an advanced orchestration platform to manage a diverse set of LLMs efficiently.
Design solutions to optimize performance, scalability, and availability across various deployment environments.
Infrastructure Management:
Utilize Kubernetes, Terraform, and other Infrastructure as Code (IAC) tools to automate and manage ML infrastructure.
Collaborate with DevOps and cloud engineering teams to ensure seamless integration with CI/CD pipelines.
Establish robust monitoring, logging, and alerting systems for ML infrastructure.
Multi-Cloud Strategy:
Design and execute strategies to leverage multiple cloud providers for cost optimization, redundancy, and compliance.
Manage cloud-native services to support model deployment and orchestration at scale.
Performance Optimization:
Work closely with ML engineers to fine-tune model deployment strategies, focusing on latency, throughput, and fault tolerance.
Conduct capacity planning and develop tools for model lifecycle management.
Leadership & Collaboration:
Lead a team of infrastructure engineers, fostering a culture of innovation, collaboration, and excellence.
Act as a bridge between ML research, engineering, and operations teams to align infrastructure capabilities with business needs.
Stay abreast of emerging technologies and methodologies in ML infrastructure and orchestration.
Qualifications:
Technical Skills:
Proven experience in building and managing ML infrastructure platforms, particularly for LLMs or other advanced AI systems.
Expertise in Kubernetes, Terraform, and other IAC tools.
Deep understanding of multi-cloud architectures (e.g., AWS, Azure, Google Cloud) and hybrid cloud solutions.
Strong programming skills in Python, Go, or a similar language, with experience in building automation and orchestration tools.
Familiarity with modern ML frameworks and tools (e.g., TensorFlow, PyTorch, Hugging Face).
Leadership & Communication:
- Demonstrated success in leading infrastructure teams and managing large-scale projects
- Excellent problem-solving and decision-making skills.
Strong communication skills, with the ability to convey complex technical ideas to non-technical stakeholders.
Education & Experience:
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field (or equivalent work experience).
- 8+ years of experience in infrastructure engineering, with at least 3 years in a leadership