Sr. SRE Consultant
Apply NowCompany: Cloud BC Labs
Location: Seattle, WA 98115
Description:
Role: Sr. SRE (Very Strong Technical SRE)
Location: Seattle, WA
WFO: Mandatory (3 days/week)
Short JD:
Job Summary/Role Description (General Information)
Core Skills/Technical Requirements
Soft Skills/Other Requirements
Key Responsibilities/Duties
Location: Seattle, WA
WFO: Mandatory (3 days/week)
Short JD:
- As a Senior Site Reliability Engineer, you will play a critical role in supporting application developers and Operations personnel by providing expert guidance on Application and infrastructure best practices from a reliability perspective.
- Your primary focus will be Observability, toil reduction through automation, and bringing in reliability with an emphasis on solving operations issues.
- Must have at least 5+ years of SRE experience in large programs with a focus on toil reduction, implementation of full-stack observability, and reduction of MTTD and MTTR.
- Must have a good understanding of Site Reliability Engineering (SRE) principles and practices.
- Should be a strong team player and enjoy collaborating with different teams, as well as share knowledge and strive for continuous improvement self and team.
- Experience with scripting in Python, PowerShell, Bash, Shell, Perl (any one of these).
- Strong experience on one or more Observability tools like Splunk, AppDynamics, Dynatrace, Datadog.
- Experience in Observability Dashboard creation, Synthetic Monitoring, and Real User Monitoring (RUM).
- Experience working on tools like Remedy, ServiceNow, Confluence, Jira.
- Experience in ITSM process including Incident, Problem, and Change management.
- Experience in setting up Service Map/ Distributed Traces in the Observability tool (good to have)
- Knowledge of operating systems like Linux/Windows, including understanding of networking. (good to have)
- Experience in software architecture, distributed systems, and development languages like Java or .Net. (good to have)
- Should possess strong analytical, troubleshooting, and problem-solving skills.
- Excellent communication skills along with leadership skills.
- Drive the reliability and performance of client's critical services.
- Drive system reliability and stability through proactive monitoring and automation.
- Implement observability frameworks and SRE best practices, including the setup of SLO/SLI.
- Define error budget as per the SLO.
- Drive a metrics-driven culture using data to measure overall system quality and reliability.
- Provide primary operational support and engineering for client's critical services.
- Manage and participate in on-call incidents.
- Work with users/Ops team to understand issues, develop root cause analysis, and work with the development team for permanent fixes.
- Working on setting up service maps / distributed traces to visualize the entire workflow and analyze the cause of problems/incidents.
- Define, evangelize, and maintain SRE best practices.
- Improve automation, including system's self-healing capability.