SRE Production Support Manager (PL)
Apply NowCompany: Charles Schwab
Location: Phoenix, AZ 85032
Description:
Your Opportunity
At Schwab, you are empowered to make an impact on your career. Here, innovative thought meets creative problem solving, helping us "challenge the status quo" and transform the finance industry together.
Core Brokerage Solutions Engineer (CBSE Save) seeks a Site Reliability Engineering (SRE) Manager who will be responsible for overseeing the reliability, performance, and availability of applications, infrastructure and services.
The primary role involves:
This is a leadership role with both technical and people leadership responsibilities. As such, this role participates in short and long-term systems planning, teams and organizational planning. This position reports directly to the Director.
Responsibilities also include, but are not limited to:
What you have
Required Qualifications:
In addition to the salary range, this role is also eligible for bonus or incentive opportunities.
At Schwab, you are empowered to make an impact on your career. Here, innovative thought meets creative problem solving, helping us "challenge the status quo" and transform the finance industry together.
Core Brokerage Solutions Engineer (CBSE Save) seeks a Site Reliability Engineering (SRE) Manager who will be responsible for overseeing the reliability, performance, and availability of applications, infrastructure and services.
The primary role involves:
- Team Leadership: Managing a team of Site Reliability Engineers and/or DevOps engineers, overseeing the offshore support resources, providing guidance, mentorship, and support to ensure the team's effectiveness and growth.
- Strategy Development: Developing strategies and roadmaps to improve the reliability, scalability, and performance of systems and applications.
- Cross-functional Collaboration: Collaborating with software engineering, operations, and other teams to design, implement, and maintain reliable and scalable systems.
This is a leadership role with both technical and people leadership responsibilities. As such, this role participates in short and long-term systems planning, teams and organizational planning. This position reports directly to the Director.
Responsibilities also include, but are not limited to:
- This role is also an active participant in all aspects of Site Reliability Engineering, including technical vision, telemetry and observation decisions, automation strategy, solution delivery, and platform incident and problem management.
- Fulfill the role of Escalation Manager/Critical Incident Manager on major incidents by facilitating incident resolutions by leading teams through effective service restoration.
- Provide advanced Incident Management and Problem Management support to teams, to effectively identify, remediate, and resolve issues related to platform reliability, stability, and performance through careful analysis of telemetry data and system logs.
- Document all changes following controls, procedures and documentation standards and raises issues and concerns with recommendations for follow-up action.
- Deploy the necessary resources to support the business and ensure the correct staffing level are in place to support the departments workload.
- Work with the other Engineering departments to ensure lessons learned and product improvement ideas are implemented in new system designs.
- Work closely with business aligned SRE leads to develop short and long-term strategies.
- Manage stakeholder expectations, resolve conflicts, and keep everyone involved aligned.
- Develop and drive execution on 6 month and 1 year road maps.
- Drive innovation, establish new approaches in improving productivity.
- Establish a metrics-based organization, develop key operational metrics and push for continuous improvement.
What you have
Required Qualifications:
- Leadership: Excellent leadership and communication skills, with the ability to inspire and motivate cross-functional teams.
- Problem-solving: Ability to analyze complex systems, troubleshoot issues, and devise effective solutions under pressure.
- Project Management: Proficiency in project management methodologies to effectively plan, execute, and track projects.
- Stakeholder Management: Ability to understand and address the needs of various stakeholders, including engineers, product managers, and business partners.
- Technical Proficiency: Strong understanding of cloud computing, networking, Linux systems administration, containerization (e.g., Docker, Kubernetes), and infrastructure as code.
- Process Improvement: Proven experience in streamlining workflows, identifying bottlenecks, and implementing process enhancements to optimize efficiency and productivity.
- Possess a strong understanding of SRE principles, DevOps practices, and relevant technologies to effectively guide and support engineering teams.
- 5+ years' of experience working in organizations with the ability to effectively communicate with executives, leaders and individual contributors across the organization.
- 5+ years of SRE experience working with telemetry, observation, self-healing solutions, and platform automation.
- Advanced experience in the use of the following platforms and tools:
- Cloud: MS Azure/AWS Cloud
- Networking fundamentals: TCP/IP, DNS, WINS, DHCP, etc.
- Collaboration & Change Management tools: Jira, ServiceNow, Cherwell, etc.
- Databases: (Oracle, MS SQL, PostgreSQL, DB2, Mongo etc.)
- Bachelor's degree in computer science or equivalent
In addition to the salary range, this role is also eligible for bonus or incentive opportunities.