SRE
Apply NowCompany: Witness AI
Location: La Canada Flintridge, CA 91011
Description:
Job Title: Site Reliability Engineer (SRE)
About Us: WitnessAI is a leader in providing innovative networking solutions designed to enhance security, performance, and reliability for businesses of all sizes. We are seeking a highly skilled Site Reliability Engineer (SRE) with a strong background in Linux administration, AWS, and Kubernetes. The ideal candidate will help ensure the reliability, scalability, and performance of our systems while driving a culture of automation and continuous improvement.
Key Responsibilities
System Reliability & Operations
Infrastructure Management
Automation & Tooling
Collaboration & Best Practices
Security & Compliance
Requirements
Technical Skills
Operational Expertise
Problem Solving & Collaboration
Education
Bachelor's degree in Computer Science, Engineering, or equivalent work experience.
Nice-to-Have Skills:
Benefits:
About Us: WitnessAI is a leader in providing innovative networking solutions designed to enhance security, performance, and reliability for businesses of all sizes. We are seeking a highly skilled Site Reliability Engineer (SRE) with a strong background in Linux administration, AWS, and Kubernetes. The ideal candidate will help ensure the reliability, scalability, and performance of our systems while driving a culture of automation and continuous improvement.
Key Responsibilities
System Reliability & Operations
- Maintain and improve the reliability, availability, and performance of our services and infrastructure.
- Monitor system health, troubleshoot issues, and respond to incidents with a focus on reducing mean time to recovery (MTTR).
Infrastructure Management
- Administer and optimize Linux-based systems across development, staging, and production environments.
- Design and manage scalable, secure, and cost-effective solutions on AWS.
- Build, maintain, and monitor Kubernetes clusters to support containerized applications.
Automation & Tooling
- Develop and maintain CI/CD pipelines to streamline deployments.
- Automate operational tasks using tools such as Terraform, Crossplane, or custom scripts.
- Create and enhance monitoring, alerting, and logging systems to improve observability.
- Build ad-hoc, reusable automation solutions where required.
Collaboration & Best Practices
- Partner with engineering teams to integrate SRE principles into the software development lifecycle.
- Advocate for best practices in incident response, post-mortem reviews, and capacity planning.
- Share knowledge with team members and contribute to a culture of continuous improvement.
Security & Compliance
- Implement security best practices for cloud and containerized environments.
- Ensure compliance with organizational and industry standards.
Requirements
Technical Skills
- Proven expertise in Linux system administration (e.g., Ubuntu, CentOS, or similar).
- Deep understanding of AWS services and architecture (e.g., EC2, S3, RDS, VPC, IAM).
- Strong experience managing Kubernetes clusters in production.
- Hands-on experience with infrastructure-as-code tools like Terraform or CloudFormation
- Proficiency in scripting or programming languages (e.g., Python, Bash, or Go).
- Demonstrated experience in app development for ba lend automation solutions.
- 3+ years of experience in a Site Reliability Engineer, DevOps Engineer, or similar role working for a SaaS or Cloud bases company.
Operational Expertise
- Familiarity with monitoring and logging tools such as Prometheus, Grafana, ELK, or Datadog
- Experience designing and maintaining CI/CD pipelines (e.g., Jenkins, GitLab CI, or CircleCI).
- Understanding of networking concepts (e.g., DNS, load balancing, firewalls).
Problem Solving & Collaboration
- Strong analytical and troubleshooting skills.
- Ability to work effectively in a collaborative, team-oriented environment.
- Excellent written and verbal communication skills.
Education
Bachelor's degree in Computer Science, Engineering, or equivalent work experience.
Nice-to-Have Skills:
- Experience with service meshes and other CNCF technologies (e.g., Istio or Linkerd).
- Knowledge of database systems (e.g., MySQL, PostgreSQL, or NoSQL databases).
- Familiarity with cloud-native technologies and tools (e.g., Helm, ArgoCD, Spinnaker).
Benefits:
- Hybrid work environment
- Competitive salary.
- Health, dental, and vision insurance.
- 401(k) plan.
- Opportunities for professional development and growth.
- Generous vacation policy.