Director of AI Infrastructure

Apply Now

Company: Microsoft

Location: Redmond, WA 98052

Description:

We are Microsoft Research, a leading industrial research laboratory comprised of over 1,000 computer scientists, engineers and technical staff working across the United States, United Kingdom, China, India, Canada, and the Netherlands. We are seeking a Director of AI infrastructure to join our dynamic team in Microsoft Research, where we are at the forefront of transforming scientific research through cutting-edge technology. As a key member of this team, you will play a pivotal role in conceptualizing, architecting, and implementing innovative solutions that drive our ambitious goals. Collaborating with partners across Microsoft Research, Microsoft and the Industry, you will have the opportunity to shape the future of machine learning infrastructure and how its leveraged for AI-driven research.

This is a management and engineering role that is responsible for the delivery of next-generation AI/ML training and inference systems in a hybrid environment. This position is responsible for ownership of graphics processing unit ( GPU) cluster delivery, hardware strategy, performance optimization of state-of-the-art systems, team coordination, and the performance development of their team. They will collaborate with PM counterparts to provide strategic planning, collaboration with internal and external teams, and manage GPU capacity across a global organization. This role requires a deep understanding of industry trends, strategic initiatives, and the ability to align services and staffing skills accordingly. This position works closely with various teams and stakeholders to ensure the successful implementation, management and utilization of services, hardware, and capacity .

Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.

In alignment with our Microsoft values, we are committed to cultivating an inclusive work environment for all employees to positively impact our culture every day.

Responsibilities:

  • Leverage industry expertise , understanding and experience to envision, develop and deliver secure services that align with organizational needs and accelerate research.
  • Identify , engage and align with strategic initiatives in the company as well as industry trends, assuring that service offerings and strategy both align and complement emerging initiatives and trends.
  • Assure team skills are aligned with service offerings, industry trends, company strategy and organizational needs.
  • Maintain relationships between relevant groups within th e organi zation and company to identify collaboration opportunities, service improvements and changes, hardware and network pilots that align with key research efforts.
  • Define the services offered and the gaps they fill, including demand, costs, and lifecycle planning.
  • Be the final point of escalation for service and system failures, owning communication with customers, engineering, program management, leadership, and other stakeholders.
  • Provide leadership, hiring, spend control, policy setting, and enforcement to keep GPU clusters running, highly utilized and training job efficiency above company averages.
  • Ensure team accountability to meet customer needs through planning, documentation, training, support feedback and incident management.
  • Ensure provisions for monitoring, alerting, and resolving issues are in place and monitor daily SLA reports to drive key insights, improvements, and potential service changes.
  • Identify impactful projects running on capable hardware and offer engineering to optimize and scale-out training to increase efficiency and utilization of GPU workloads.
  • Team management and development such as conduct 1:1s, performance reviews, planning agendas for team, engineering, quarterly business review meetings, partner with program management, own ADO strategy and backlog reviews and drive career growth, development and recruiting efforts.
  • Embody our culture and values .


Qualifications:

Required Qualifications
  • Bachelor's Degree in Computer Science, Information Technology, or related field AND 4+ years technical experience in software engineering, network engineering, service engineering, or systems engineering
    • OR equivalent experience.
Preferred Qualifications
  • Master's in computer science or a related technical field
  • 3+ years people Management experience of cross-functional and/or cross-team projects.
  • Experience with large language model architectures, inference and fine-tuning
  • Proven experience delivering Improved automation, monitoring, and sustainability of services
  • Experience with ML infrastructure including job schedulers, big data storage, low latency interconnects, and large clusters of GPUs
  • Experience with containerization and cloud computing technologies
  • Experience working in an academic or industrial research environment and communicating effectively within such an environment
  • Experience with engineering practices, continuous integration and continuous delivery/continuous deployment (CI/CD) pipelines and Git
  • Experience in cloud/infrastructure technologies, information technology (IT) consulting/support, systems administration, network operations, software development/support, technology solutions, practice development, architecture, and/or consulting or related field
Service Engineering M5 - The typical base pay range for this role across the U.S. is USD $137,600 - $267,000 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $180,400 - $294,000 per year.

Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay

Microsoft will accept applications for the role until February 3, 2025.

Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. If you need assistance and/or a reasonable accommodation due to a disability during the application or the recruiting process, please send a request via the Accommodation request form .

Benefits/perks listed below may vary depending on the nature of your employment with Microsoft and the country where you work.

#Research

#MSRR

Similar Jobs