Site Reliability Engineer
Apply NowCompany: Cloud BC Labs
Location: Reston, VA 20191
Description:
Position 2: Site Reliability Engineer
Location: Reston, VA (Hybrid onsite)
Term : C2C/W2
Duration: 1 year with possible extension
About the Role:
We are seeking a Site Reliability Engineer (SRE) to join our Enterprise Technology Operations (ETO) team. This role follows the YBYO (You Build, You Operate) model, meaning you will be responsible for both building and supporting the product. The ideal candidate must have a strong problem-solving mindset, experience in incident management, and expertise in AWS infrastructure.
Key Responsibilities:
Design, develop, and maintain automation solutions to enhance system reliability and efficiency.
Implement monitoring and observability solutions to ensure visibility into system performance and health.
Apply SRE principles and best practices, automating repetitive tasks to improve system resilience.
Manage incident response and troubleshooting, ensuring rapid issue resolution.
Optimize and support AWS services, including EC2, Lambda, ECS, Batch, S3, RDS, and CloudWatch.
Conduct resiliency testing, identifying and mitigating potential system weaknesses.
Provide after-hours support when necessary in case of critical incidents.
Collaborate with cross-functional teams and stakeholders to ensure seamless system operations.
Mentor junior team members and drive continuous improvement initiatives.
Adapt to new tools and methodologies, staying ahead in the evolving cloud landscape.
Required Skills & Experience:
Strong SRE experience in AWS with hands-on expertise in cloud infrastructure.
Proficiency in automation, monitoring & observability, and resiliency testing.
Experience with AWS services such as EC2, Lambda, ECS, Batch, S3, RDS, and CloudWatch.
Knowledge of incident management and ability to troubleshoot complex system issues.
Experience in automating repetitive tasks to improve system reliability.
Strong communication skills to interact with portfolio teams and stakeholders.
Ability to take ownership, manage multiple projects, and meet deadlines.
Nice-to-Have Skills:
Exposure to chaos engineering (not a must-have).
Familiarity with new tools and methodologies in cloud and infrastructure automation.
Experience mentoring junior team members and contributing to a culture of continuous improvement.
Location: Reston, VA (Hybrid onsite)
Term : C2C/W2
Duration: 1 year with possible extension
About the Role:
We are seeking a Site Reliability Engineer (SRE) to join our Enterprise Technology Operations (ETO) team. This role follows the YBYO (You Build, You Operate) model, meaning you will be responsible for both building and supporting the product. The ideal candidate must have a strong problem-solving mindset, experience in incident management, and expertise in AWS infrastructure.
Key Responsibilities:
Design, develop, and maintain automation solutions to enhance system reliability and efficiency.
Implement monitoring and observability solutions to ensure visibility into system performance and health.
Apply SRE principles and best practices, automating repetitive tasks to improve system resilience.
Manage incident response and troubleshooting, ensuring rapid issue resolution.
Optimize and support AWS services, including EC2, Lambda, ECS, Batch, S3, RDS, and CloudWatch.
Conduct resiliency testing, identifying and mitigating potential system weaknesses.
Provide after-hours support when necessary in case of critical incidents.
Collaborate with cross-functional teams and stakeholders to ensure seamless system operations.
Mentor junior team members and drive continuous improvement initiatives.
Adapt to new tools and methodologies, staying ahead in the evolving cloud landscape.
Required Skills & Experience:
Strong SRE experience in AWS with hands-on expertise in cloud infrastructure.
Proficiency in automation, monitoring & observability, and resiliency testing.
Experience with AWS services such as EC2, Lambda, ECS, Batch, S3, RDS, and CloudWatch.
Knowledge of incident management and ability to troubleshoot complex system issues.
Experience in automating repetitive tasks to improve system reliability.
Strong communication skills to interact with portfolio teams and stakeholders.
Ability to take ownership, manage multiple projects, and meet deadlines.
Nice-to-Have Skills:
Exposure to chaos engineering (not a must-have).
Familiarity with new tools and methodologies in cloud and infrastructure automation.
Experience mentoring junior team members and contributing to a culture of continuous improvement.