Sr. SRE Consultant

Apply Now

Company: Cloud BC Labs

Location: Seattle, WA 98115

Description:

Role: Sr. SRE (Very Strong Technical SRE)

Location: Seattle, WA

WFO: Mandatory (3 days/week)

Short JD:

Job Summary/Role Description (General Information)

As a Senior Site Reliability Engineer, you will play a critical role in supporting application developers and Operations personnel by providing expert guidance on Application and infrastructure best practices from a reliability perspective.
Your primary focus will be Observability, toil reduction through automation, and bringing in reliability with an emphasis on solving operations issues.
Must have at least 5+ years of SRE experience in large programs with a focus on toil reduction, implementation of full-stack observability, and reduction of MTTD and MTTR.
Must have a good understanding of Site Reliability Engineering (SRE) principles and practices.
Should be a strong team player and enjoy collaborating with different teams, as well as share knowledge and strive for continuous improvement self and team.

Core Skills/Technical Requirements

Experience with scripting in Python, PowerShell, Bash, Shell, Perl (any one of these).
Strong experience on one or more Observability tools like Splunk, AppDynamics, Dynatrace, Datadog.
Experience in Observability Dashboard creation, Synthetic Monitoring, and Real User Monitoring (RUM).
Experience working on tools like Remedy, ServiceNow, Confluence, Jira.
Experience in ITSM process including Incident, Problem, and Change management.
Experience in setting up Service Map/ Distributed Traces in the Observability tool (good to have)
Knowledge of operating systems like Linux/Windows, including understanding of networking. (good to have)
Experience in software architecture, distributed systems, and development languages like Java or .Net. (good to have)

Soft Skills/Other Requirements

Key Responsibilities/Duties

Drive the reliability and performance of client's critical services.
Drive system reliability and stability through proactive monitoring and automation.
Implement observability frameworks and SRE best practices, including the setup of SLO/SLI.
Define error budget as per the SLO.
Drive a metrics-driven culture using data to measure overall system quality and reliability.
Provide primary operational support and engineering for client's critical services.
Manage and participate in on-call incidents.
Work with users/Ops team to understand issues, develop root cause analysis, and work with the development team for permanent fixes.
Working on setting up service maps / distributed traces to visualize the entire workflow and analyze the cause of problems/incidents.
Define, evangelize, and maintain SRE best practices.
Improve automation, including system's self-healing capability.