Staff Software Engineer / Tech Lead (Model Training Infrastructure)

Apply Now

Company: Anyscale, Inc

Location: San Francisco, CA 94112

Description:

About Anyscale:

At Anyscale, we're on a mission to democratize distributed computing and make it accessible to software developers of all skill levels. We're commercializing Ray, a popular open-source project that's creating an ecosystem of libraries for scalable machine learning. Companies like OpenAI, Uber, Spotify, Instacart, Cruise, and many more, have Ray in their tech stacks to accelerate the progress of AI applications out into the real world.

With Anyscale, we're building the best place to run Ray, so that any developer or data scientist can scale an ML application from their laptop to the cluster without needing to be a distributed systems expert.

Proud to be backed by Andreessen Horowitz, NEA, and Addition with $250+ million raised to date.

About The Role:

Anyscale is looking for a staff software engineer to lead the Model Training Infrastructure team.

The Model Training Infrastructure team leads the development and optimization of Ray's distributed training libraries, focusing on enabling large-scale ML workloads. The team owns and maintains widely adopted open source libraries like Ray Train for distributed model training and Ray Tune for distributed hyperparameter tuning.

As the technical leader for this team, you will be responsible for:
  • Thinking deeply about delightful, programmatic interfaces for machine learning engineers to scale model training
  • Build and rethink distributed training architectures to scale seamlessly from laptop to the cloud
  • Implement and innovate on distributed training algorithms like elastic training to improve model training performance
  • Working with and leading a robust open source community around the Ray project
  • Engage directly with ML infrastructure teams around the world to iterate and build the best training infrastructure.
  • Advocate and share your work broadly with the ML community through talks, tutorials, and blog posts

On the day-to-day basis, you will drive the technical direction of the team, mentor engineers, and deliver high-impact projects. You'll shape the vision for what training infrastructure looks like for enterprises around the world and remain hands-on with the code and product development.

We'd love to hear from you if you have:
  • Multiple years of experience building, scaling, and maintaining complex software systems in production
  • Proven experience leading or mentoring engineering teams in a technical capacity
  • Expertise in machine learning frameworks (e.g., PyTorch, TensorFlow, XGBoost)
  • Hands-on experience with distributed systems and designing fault-tolerant infrastructure
  • Excellent communication and collaboration skills

Bonus points if you have:
  • Experience with Ray
  • Experience with cloud technologies (e.g., AWS, GCP, Kubernetes)
  • Experience building and operating ML training platforms in production
  • Contributions to or maintenance of open-source libraries
  • Experience leading open-source or cross-functional teams

Compensation:
  • At Anyscale, we take a market-based approach to compensation. We are data-driven, transparent, and consistent. The target salary for this role is $237,000 ~ $284,614. As the market data changes over time, the target salary for this role may be adjusted.
  • This role is also eligible to participate in Anyscale's Equity and Benefits offerings, including the following:
  • Stock Options
  • Healthcare plans, with premiums covered by Anyscale at 99%
  • 401k Retirement Plan
  • Education & Wellbeing Stipend
  • Paid Parental Leave
  • Fertility Benefits
  • Flexible Time Off
  • Commute reimbursement
  • 100% of in office meals covered

Anyscale Inc. is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law.

Anyscale Inc. is an E-Verify company and you may review the Notice of E-Verify Participation and the Right to Work posters in English and Spanish

Similar Jobs