Site Reliability Engineer, Machine Learning Operations, Infrastructure

Apply Now

Company: Tesla, Inc

Location: Austin, TX 78745

Description:

Our team manages multiple functions across Tesla that includes Devops, MLOps, Cloud Infrastructure (AWS, Azure, GCP), and factory site reliability. Continued development and automation of deployment, monitoring, self-healing and alerting processes is imperative to the success of our engineering groups. As a Site Reliability Engineer, you will be responsible for maintaining and improving our platform to ensure our cross functional teams have the necessary tools and resources to be productive.

Responsibilities
  • Mature our Machine Learning Operations Platform and advocate best practices to MLops engineers and design and implement scalable, automated workflows for the complete ML lifecycle
  • Maintain Kubernetes-based infrastructure for model training, deployment, and monitoring
  • Develop solutions for workload orchestration and time-slicing using tools like Flyteand Ray
  • Implement and optimize CI/CD pipelines tailored for machine learning applications
  • Leverage GPU capabilities, including MIG, to maximize efficiency for AI/ML workloads
  • Set up model monitoring systems to track performance, ensure robustness, and scale workloads as needed
  • Collaborate with engineers to build and maintain robust, pipelines for training and inference workflows
  • Develop Infrastructure-as-Code (IaC) solutions for deploying and managing cloud/on-prem ML environments
  • Design and develop intuitive, user-friendly self-service portals using React to enable data scientists and engineers to manage ML pipelines, monitor models, and access resources seamlessly
  • Participate in 24x7 on-call rotation


Requirements
  • Strong hands-on experience with tools and frameworks like Kubernetes, Kubeflow, MLflow, Flyte, / Ray
  • Proven experience with React for building interactive web applications, especially self-service portals that enhance the user experience for managing ML pipelines and workflows
  • Expertise in MIG, time-slicing, and scaling AI workloads efficiently
  • Proficiency in Python, Golang and bash for pipeline development, and automation
  • Model Deployment and Serving: Tensorflow Serving, TorchServe, FastAPI, Flask,REST/gRPC on scalable architectures
  • Proficiency with Linux fundamentals and performance optimizations
  • Experience with configuration management software (Ansible, etc.), systems monitoring & alerting (Prometheus, Grafana, Telegraf, Splunk, etc.)
  • Strong analytical and problem-solving abilities to troubleshoot and optimize AI/ML systems
  • Ability to collaborate with cross-functional teams, including data scientists, data engineers, and DevOps engineers, to deliver high-quality solutions.Excellent troubleshooting skills in production
  • Bachelor's Degree in Computer Science, Computer Engineering, Electrical Engineering, Physics or proof of exceptional skills in related field or equivalent experience


Compensation and Benefits
Benefits

Along with competitive pay, as a full-time Tesla employee, you are eligible for the following benefits at day 1 of hire:
  • Aetna PPO and HSA plans > 2 medical plan options with $0 payroll deduction
  • Family-building, fertility, adoption and surrogacy benefits
  • Dental (including orthodontic coverage) and vision plans, both have options with a $0 paycheck contribution
  • Company Paid (Health Savings Account) HSA Contribution when enrolled in the High Deductible Aetna medical plan with HSA
  • Healthcare and Dependent Care Flexible Spending Accounts (FSA)
  • 401(k) with employer match, Employee Stock Purchase Plans, and other financial benefits
  • Company paid Basic Life, AD&D, short-term and long-term disability insurance
  • Employee Assistance Program
  • Sick and Vacation time (Flex time for salary positions), and Paid Holidays
  • Back-up childcare and parenting support resources
  • Voluntary benefits to include: critical illness, hospital indemnity, accident insurance, theft & legal services, and pet insurance
  • Weight Loss and Tobacco Cessation Programs
  • Tesla Babies program
  • Commuter benefits
  • Employee discounts and perks program

    Similar Jobs