Sr. SRE Consultant

Apply Now

Company: Cloud BC Labs

Location: Seattle, WA 98115

Description:

Role: Sr. SRE (Very Strong Technical SRE)

Location: Seattle, WA

WFO: Mandatory (3 days/week)

Short JD:
  • Job Summary/Role Description (General Information)
    • As a Senior Site Reliability Engineer, you will play a critical role in supporting application developers and Operations personnel by providing expert guidance on Application and infrastructure best practices from a reliability perspective.
    • Your primary focus will be Observability, toil reduction through automation, and bringing in reliability with an emphasis on solving operations issues.
    • Must have at least 5+ years of SRE experience in large programs with a focus on toil reduction, implementation of full-stack observability, and reduction of MTTD and MTTR.
    • Must have a good understanding of Site Reliability Engineering (SRE) principles and practices.
    • Should be a strong team player and enjoy collaborating with different teams, as well as share knowledge and strive for continuous improvement self and team.


  • Core Skills/Technical Requirements
    • Experience with scripting in Python, PowerShell, Bash, Shell, Perl (any one of these).
    • Strong experience on one or more Observability tools like Splunk, AppDynamics, Dynatrace, Datadog.
    • Experience in Observability Dashboard creation, Synthetic Monitoring, and Real User Monitoring (RUM).
    • Experience working on tools like Remedy, ServiceNow, Confluence, Jira.
    • Experience in ITSM process including Incident, Problem, and Change management.
    • Experience in setting up Service Map/ Distributed Traces in the Observability tool (good to have)
    • Knowledge of operating systems like Linux/Windows, including understanding of networking. (good to have)
    • Experience in software architecture, distributed systems, and development languages like Java or .Net. (good to have)


  • Soft Skills/Other Requirements
    • Should possess strong analytical, troubleshooting, and problem-solving skills.
    • Excellent communication skills along with leadership skills.


  • Key Responsibilities/Duties
    • Drive the reliability and performance of client's critical services.
    • Drive system reliability and stability through proactive monitoring and automation.
    • Implement observability frameworks and SRE best practices, including the setup of SLO/SLI.
    • Define error budget as per the SLO.
    • Drive a metrics-driven culture using data to measure overall system quality and reliability.
    • Provide primary operational support and engineering for client's critical services.
    • Manage and participate in on-call incidents.
    • Work with users/Ops team to understand issues, develop root cause analysis, and work with the development team for permanent fixes.
    • Working on setting up service maps / distributed traces to visualize the entire workflow and analyze the cause of problems/incidents.
    • Define, evangelize, and maintain SRE best practices.
    • Improve automation, including system's self-healing capability.

    Similar Jobs