SRE

Apply Now

Company: Witness AI

Location: La Canada Flintridge, CA 91011

Description:

Job Title: Site Reliability Engineer (SRE)

About Us: WitnessAI is a leader in providing innovative networking solutions designed to enhance security, performance, and reliability for businesses of all sizes. We are seeking a highly skilled Site Reliability Engineer (SRE) with a strong background in Linux administration, AWS, and Kubernetes. The ideal candidate will help ensure the reliability, scalability, and performance of our systems while driving a culture of automation and continuous improvement.

Key Responsibilities

System Reliability & Operations
  • Maintain and improve the reliability, availability, and performance of our services and infrastructure.
  • Monitor system health, troubleshoot issues, and respond to incidents with a focus on reducing mean time to recovery (MTTR).

Infrastructure Management
  • Administer and optimize Linux-based systems across development, staging, and production environments.
  • Design and manage scalable, secure, and cost-effective solutions on AWS.
  • Build, maintain, and monitor Kubernetes clusters to support containerized applications.

Automation & Tooling
  • Develop and maintain CI/CD pipelines to streamline deployments.
  • Automate operational tasks using tools such as Terraform, Crossplane, or custom scripts.
  • Create and enhance monitoring, alerting, and logging systems to improve observability.
  • Build ad-hoc, reusable automation solutions where required.

Collaboration & Best Practices
  • Partner with engineering teams to integrate SRE principles into the software development lifecycle.
  • Advocate for best practices in incident response, post-mortem reviews, and capacity planning.
  • Share knowledge with team members and contribute to a culture of continuous improvement.

Security & Compliance
  • Implement security best practices for cloud and containerized environments.
  • Ensure compliance with organizational and industry standards.


Requirements

Technical Skills
  • Proven expertise in Linux system administration (e.g., Ubuntu, CentOS, or similar).
  • Deep understanding of AWS services and architecture (e.g., EC2, S3, RDS, VPC, IAM).
  • Strong experience managing Kubernetes clusters in production.
  • Hands-on experience with infrastructure-as-code tools like Terraform or CloudFormation
  • Proficiency in scripting or programming languages (e.g., Python, Bash, or Go).
  • Demonstrated experience in app development for ba lend automation solutions.
  • 3+ years of experience in a Site Reliability Engineer, DevOps Engineer, or similar role working for a SaaS or Cloud bases company.

Operational Expertise
  • Familiarity with monitoring and logging tools such as Prometheus, Grafana, ELK, or Datadog
  • Experience designing and maintaining CI/CD pipelines (e.g., Jenkins, GitLab CI, or CircleCI).
  • Understanding of networking concepts (e.g., DNS, load balancing, firewalls).

Problem Solving & Collaboration
  • Strong analytical and troubleshooting skills.
  • Ability to work effectively in a collaborative, team-oriented environment.
  • Excellent written and verbal communication skills.

Education

Bachelor's degree in Computer Science, Engineering, or equivalent work experience.

Nice-to-Have Skills:
  • Experience with service meshes and other CNCF technologies (e.g., Istio or Linkerd).
  • Knowledge of database systems (e.g., MySQL, PostgreSQL, or NoSQL databases).
  • Familiarity with cloud-native technologies and tools (e.g., Helm, ArgoCD, Spinnaker).

Benefits:
  • Hybrid work environment
  • Competitive salary.
  • Health, dental, and vision insurance.
  • 401(k) plan.
  • Opportunities for professional development and growth.
  • Generous vacation policy.

Similar Jobs