SRE - Performance Engineering
Apply NowCompany: Witness AI
Location: La Canada Flintridge, CA 91011
Description:
Job Title: Site Reliability Engineering - Performance Engineer
Location: Bay Area preferred/Hybrid
Department: DevOps
At WitnessAI, we're at the intersection of innovation and security in AI. We are seeking a Site Reliability Engineer - This role emphasizes deep systems-level performance analysis, tuning, and optimization to ensure the reliability and efficiency of our cloud-based infrastructure. You will drive performance across a tech stack that includes Cloud Infrastructure, Linux, Kubernetes, databases, message queuing systems, AI workloads, and GPUs. The ideal candidate brings a passion for data-driven methodologies, flame graph analysis, and advanced performance debugging to solve complex system challenges.
Key Responsibilities
Qualifications Required:
Preferred:
Location: Bay Area preferred/Hybrid
Department: DevOps
At WitnessAI, we're at the intersection of innovation and security in AI. We are seeking a Site Reliability Engineer - This role emphasizes deep systems-level performance analysis, tuning, and optimization to ensure the reliability and efficiency of our cloud-based infrastructure. You will drive performance across a tech stack that includes Cloud Infrastructure, Linux, Kubernetes, databases, message queuing systems, AI workloads, and GPUs. The ideal candidate brings a passion for data-driven methodologies, flame graph analysis, and advanced performance debugging to solve complex system challenges.
Key Responsibilities
- Conduct root cause analysis (RCA) for performance bottlenecks using data-driven approaches like flame graphs, heatmaps, and latency histograms.
- Perform detailed kernel and application tracing using tools based on technologies like eBPF, perf, and ftrace to gain insights into system behavior.
- Design and implement performance dashboards to visualize key performance metrics in real-time.
- Recommend Linux and Cloud Server tuning improvements to increase throughput and latency
- Tune Linux systems for workload-specific demands, including scheduler, I/O subsystem, and memory management optimizations.
- Analyze and optimize cloud instance types, EBS volumes, and network configurations for high performance and low latency.
- Improve throughput and latency for message queues (e.g., ActiveMQ, Kafka, SQS, etc) by profiling producer/consumer behavior and tuning configurations.
- Apply profiling tools to analyze GPU utilization and kernel execution times and implement techniques to boost GPU efficiency.
- Optimize distributed training pipelines using industry-standard frameworks.
- Evaluate and reduce training times through mixed precision training, model quantization, and resource-aware scheduling in Kubernetes.
- Work with AI teams to identify scaling challenges and optimize GPU workloads for inference and training.
- Design observability systems for granular monitoring of end-to-end latency, throughput, and resource utilization.
- Implement and leverage modern observability stacks to capture critical insights into application and infrastructure behavior.
- Work with developers to refactor applications for performance and scalability, using profiling tools
- Mentor teams on performance best practices, debugging workflows, and methodologies inspired by leading performance engineers.
Qualifications Required:
- Deep expertise in Linux systems internals (kernel, I/O, networking, memory management) and performance tuning.
- Strong experience with AWS cloud services and their performance optimization techniques.
- Proficiency in performance analysis and load testing tools and other system tracing frameworks.
- Hands-on experience with database tuning, query analysis, and indexing strategies.
- Expertise in GPU workload optimization, and cloud-based GPU instances
- Familiarity with message queuing systems including performance tuning.
- Programming experience with a focus on profiling and tuning
- Strong scripting skills (e.g., Python, Bash) to automate performance measurement and tuning workflows.
Preferred:
- Knowledge of distributed AI/ML training frameworks
- Experience designing and scaling GPU workloads on Kubernetes using GPU-aware scheduling and resource isolation.
- Expertise in optimizing AI inference pipelines.
- Familiarity with Brendan Gregg's methodologies for systems analysis, such as USE (Utilization, Saturation, Errors) and Workload Characterization Frameworks.
- Hybrid work environment
- Competitive salary
- Health, dental, and vision insurance
- 401(k) plan
- Opportunities for professional development and growth
- Generous vacation policy