SRE - Performance Engineering

Apply Now

Company: Witness AI

Location: La Canada Flintridge, CA 91011

Description:

Job Title: Site Reliability Engineering - Performance Engineer

Location: Bay Area preferred/Hybrid

Department: DevOps

At WitnessAI, we're at the intersection of innovation and security in AI. We are seeking a Site Reliability Engineer - This role emphasizes deep systems-level performance analysis, tuning, and optimization to ensure the reliability and efficiency of our cloud-based infrastructure. You will drive performance across a tech stack that includes Cloud Infrastructure, Linux, Kubernetes, databases, message queuing systems, AI workloads, and GPUs. The ideal candidate brings a passion for data-driven methodologies, flame graph analysis, and advanced performance debugging to solve complex system challenges.

Key Responsibilities
  • Conduct root cause analysis (RCA) for performance bottlenecks using data-driven approaches like flame graphs, heatmaps, and latency histograms.
  • Perform detailed kernel and application tracing using tools based on technologies like eBPF, perf, and ftrace to gain insights into system behavior.
  • Design and implement performance dashboards to visualize key performance metrics in real-time.
  • Recommend Linux and Cloud Server tuning improvements to increase throughput and latency
  • Tune Linux systems for workload-specific demands, including scheduler, I/O subsystem, and memory management optimizations.
  • Analyze and optimize cloud instance types, EBS volumes, and network configurations for high performance and low latency.
  • Improve throughput and latency for message queues (e.g., ActiveMQ, Kafka, SQS, etc) by profiling producer/consumer behavior and tuning configurations.
  • Apply profiling tools to analyze GPU utilization and kernel execution times and implement techniques to boost GPU efficiency.
  • Optimize distributed training pipelines using industry-standard frameworks.
  • Evaluate and reduce training times through mixed precision training, model quantization, and resource-aware scheduling in Kubernetes.
  • Work with AI teams to identify scaling challenges and optimize GPU workloads for inference and training.
  • Design observability systems for granular monitoring of end-to-end latency, throughput, and resource utilization.
  • Implement and leverage modern observability stacks to capture critical insights into application and infrastructure behavior.
  • Work with developers to refactor applications for performance and scalability, using profiling tools
  • Mentor teams on performance best practices, debugging workflows, and methodologies inspired by leading performance engineers.


Qualifications Required:
  • Deep expertise in Linux systems internals (kernel, I/O, networking, memory management) and performance tuning.
  • Strong experience with AWS cloud services and their performance optimization techniques.
  • Proficiency in performance analysis and load testing tools and other system tracing frameworks.
  • Hands-on experience with database tuning, query analysis, and indexing strategies.
  • Expertise in GPU workload optimization, and cloud-based GPU instances
  • Familiarity with message queuing systems including performance tuning.
  • Programming experience with a focus on profiling and tuning
  • Strong scripting skills (e.g., Python, Bash) to automate performance measurement and tuning workflows.


Preferred:
  • Knowledge of distributed AI/ML training frameworks
  • Experience designing and scaling GPU workloads on Kubernetes using GPU-aware scheduling and resource isolation.
  • Expertise in optimizing AI inference pipelines.
  • Familiarity with Brendan Gregg's methodologies for systems analysis, such as USE (Utilization, Saturation, Errors) and Workload Characterization Frameworks.
Benefits:
  • Hybrid work environment
  • Competitive salary
  • Health, dental, and vision insurance
  • 401(k) plan
  • Opportunities for professional development and growth
  • Generous vacation policy

Similar Jobs