Site Reliability Engineer

Apply Now

Company: My3Tech

Location: Richmond, CA 94804

Description:

Role and Responsibilities

Reporting to the Head of Cloud/API Engineering, the Cloud Reliability Engineer will play a critical role in driving innovation and growth for the Banking Solutions business. In this role, the candidate will have the opportunity to make a lasting impact on the company's digital transformation journey, drive customer-centric innovation and automation, and position the organization as a leader in the competitive digital banking landscape. Specifically, the Cloud Reliability Engineer will be responsible for the following:

Strategize and drive the building blocks of reliability engineering as we make the transition from private to public cloud.

Ensure the reliability, availability, and performance of applications and services, focusing on minimizing downtime, optimizing response times, and maintaining high availability for users.

Lead incident response efforts for incidents, including identification, triage, resolution, and post-incident analysis to prevent recurrence and improve system resilience.

Develop and maintain monitoring solutions and alerting mechanisms for infrastructure, application performance, and user experience metrics, enabling proactive issue detection and mitigation.

Implement automation tools and processes to automate routine tasks, scale infrastructure, and ensure seamless deployments, updates, and rollbacks with minimal user impact.

Conduct capacity planning, performance tuning, and resource optimization for environments, collaborating with development and operations teams to meet scalability and performance goals.

Collaborate with security teams to implement security best practices, perform vulnerability assessments, and ensure compliance with security standards and regulatory requirements for applications.

Manage deployment pipelines, release processes, and configuration management for app deployments, ensuring consistency, reliability, and version control across environments.

Identify areas for improvement in reliability, performance, and efficiency through data analysis, root cause analysis, and trend analysis, and drive initiatives to enhance system reliability and operational efficiency.

Create and maintain documentation, runbooks, and knowledge base articles for operational procedures, troubleshooting guides, and best practices, and promote knowledge sharing within the team.

Develop and test disaster recovery plans, backup strategies, and failover mechanisms for app services, ensuring business continuity and data integrity in case of failures or disasters.

Collaborate with development, QA, DevOps, and product teams to ensure alignment on reliability goals, performance metrics, release schedules, and incident response processes.

Participate in on-call rotations and provide 24/7 support for critical incidents, troubleshoot issues, and coordinate with teams for resolution, escalation, and follow-up actions as per defined SLAs.

Required Skills : 1. Role and Responsibilities: -Kostas is leading the Digital Cloud Enablement (DCE) team and the Site Reliability Engineering (SRE) team for digital. -The focus is on revamping the observability platform to create a premium package for ease of use by engineers. 2. Observability Platform: -The goal is to build a user-friendly platform that provides automatic dashboards and monitoring upon deployment. -This platform aims to simplify the onboarding process for engineers, making it easier for them to take responsibility for their platforms. 3. Collaboration with CIO SRE Team: -Kostas is working with the broader CIO SRE team to create a flexible package that can be utilized across the organization. -The package will use open telemetry collectors to gather data, which can then be routed to different monitoring platforms like Prometheus or integrated into the new system being built. 4. Current Tools and Utilization: -Existing tools include Dynatrace, Splunk, and X Matters, but they are underutilized and used differently across various value streams. -The aim is to standardize methods within the digital value stream to improve efficiency and visibility. 5. Staffing and Team Structure: -The SRE team has four members with two open positions (one contractor and one full-time). -The DCE team has five members with one open contractor position. -There have been delays in filling these positions, potentially pushing new hires to the next year. 6. Impact on Business and Clients: -The initiatives will improve deployment speed, platform stability, and performance, ultimately enhancing the customer experience. -The focus is on availability and uptime, ensuring reliable and efficient service delivery. 7. Infrastructure Modernization: -The DCE team is also working on modernizing the underlying infrastructure, following a new target architecture. (Corey Terrell leading this) -This includes redistributing accounts, networking, and revisiting security measures. The goal is to manage everything through GitHub, automating releases and reducing manual intervention. 8. Current State vs. Ideal State: -Currently, the infrastructure includes public clouds (Azure and AWS) and a private cloud hosted in multiple data centers. -The aim is to modernize the AWS public cloud environment, decomposing monolithic applications into microservices and standardizing deployment methods.

Basic Qualification :

Additional Skills :

Background Check : Yes

Drug Screen : Yes

Notes :
Selling points for candidate :
Project Verification Info :MSA: Blanket Approval Received Client Letter: Will Not Provide"
Candidate must be your W2 Employee :Yes
Exclusive to Apex :No
Face to face interview required :No
Candidate must be local :No
Candidate must be authorized to work without sponsorship ::No
Interview times set : :No
Type of project :
Master Job Title :
Branch Code :

Site Reliability Engineer

Similar Jobs