Sr. Site Reliability Engineer

Apply Now

Company: Addison Group

Location: Austin, TX 78745

Description:

n this position, you will be a vital member of our Site Reliability Engineering (SRE) team, responsible for improving incident response, advancing problem management, identifying automation opportunities, and managing observability tools. You'll work closely with Platform and Value Stream teams to strengthen system resiliency, champion a culture of Site Reliability Engineering, and support our transition from on-premise to cloud infrastructure.

Responsibilities & Qualifications

Ideal candidates will:
  • Lead positive change with clear, collaborative leadership and measurable project outcomes.
  • Solve challenges independently while offering solutions-focused guidance to peers.
  • Empower team growth by sharing knowledge transparently and providing constructive feedback.
  • Foster a culture of diversity of thought, mutual trust, and accountability.

What you'll do:
  • Take ownership of key projects, driving efforts to improve efficiency, enable self-service, and automate manual processes.
  • Manage initiatives from discovery through planning, scheduling, and execution using Agile Scrum methodologies.
  • Lead high-stakes production incidents as a Senior Incident Commander, ensuring rapid resolution, clear communication, and poise under pressure.
  • Facilitate post-incident retrospectives, transforming technical learnings into actionable improvements.
  • Architect, implement, and maintain cutting-edge observability systems to ensure proactive incident detection and resolution.
  • Build and manage integrations across systems to streamline monitoring, alerting, and health reporting.
  • Define and execute strategies for system availability, performance, and reliability, aligning with organizational goals.
  • Collaborate with stakeholders to establish Service Level Objectives (SLOs) and design strategies for managing breaches.
  • Mentor and guide team members, setting high standards for technical excellence and operational discipline.
  • Offer candid, constructive feedback to improve processes, systems, and team performance.
  • Serve as a trusted advisor, advocating for best practices in reliability engineering and driving cultural change across the organization.


It is required that you have:
  • Bachelor's degree in a related field or equivalent education, training, or experience.
  • At least 4 years of experience in site reliability engineering, DevOps, or related engineering discipline (or equivalent education, training or experience).
  • Strong leadership skills in incident management and operational excellence.
  • Demonstrated initiative, independent work, and results-driven success
  • Expertise in building and optimizing complex systems

It would be great to also have:
  • Expertise in ITIL practices and their application in modern IT environments.
  • Extensive experience in operations and engineering with distributed systems.
  • Proficiency with Git and modern CI/CD pipelines.
  • Advanced skills in programming (Java, C#) and scripting (Python, PowerShell, Bash).
  • Hands-on experience with automation tools (Terraform, Ansible) and infrastructure as code.
  • Proven success in implementing monitoring, logging, and alerting solutions.
  • Exceptional collaboration, negotiation, and presentation skills, with the ability to inspire and influence.
  • Experience providing constructive feedback and fostering continuous improvement.
  • A passion for achieving results, with a strong sense of accountability and teamwork.

Similar Jobs