Site Reliability Engineer - Parsippany, NJ (Onsite)
Apply NowCompany: Brandon Consulting Associates
Location: Parsippany, NJ 07054
Description:
An SRE bridges the gap between software engineering and IT operations, focusing on building reliable, scalable, and efficient systems. The role was pioneered at Google and has since become common across tech and enterprise companies.
Key Responsibilities
1. System Reliability & Uptime
Ensure applications and infrastructure meet reliability goals (like 99.9% uptime).
Set and track SLAs (Service Level Agreements), SLOs (Service Level Objectives), and SLIs (Service Level Indicators).
Investigate outages and performance issues and implement fixes to prevent recurrence.
2. Automation & Tooling
Automate repetitive operational tasks (like deployments, monitoring setups, and scaling).
Write scripts and code (often in Python, Bash, or Go) to improve processes.
Build CI/CD pipelines to automate code deployment.
3. Monitoring & Observability
Set up monitoring tools (like Datadog, Prometheus, Dynatrace, etc.) to track system health.
Develop dashboards and alerts to detect problems before they impact users.
Analyze logs, traces, and metrics to uncover performance bottlenecks.
4. Incident Management
Act as first responders during production incidents.
Lead post-mortems to find root causes and prevent future issues.
Collaborate with development and infrastructure teams to resolve problems quickly.
5. Performance Optimization
Continuously assess system and application performance.
Identify ways to improve latency, throughput, and resource utilization.
Conduct load testing and chaos engineering to simulate failures.
6. Infrastructure Management
Provision and manage cloud infrastructure (AWS, Azure, GCP) using Infrastructure as Code (IaC) tools like Terraform.
Ensure systems are secure, cost-efficient, and scalable.
7. Collaboration
Work closely with developers, QA, security teams, and product teams to embed reliability into application design.
Educate teams on best practices for reliability and operational excellence.
Skills Needed
Coding & scripting (Python, Bash, etc.)
Cloud platforms (AWS, Azure, GCP)
Monitoring & Observability (Prometheus, Datadog, Dynatrace)
CI/CD tools (Jenkins, GitHub Actions, ArgoCD)
Incident response & troubleshooting
Strong knowledge of Linux systems
Networking basics (DNS, Load Balancing, etc.)
Communication & documentation skills
Overall Goal
The goal of an SRE is to ensure that systems are reliable, scalable, and maintainable - with minimal manual intervention.
Key Responsibilities
1. System Reliability & Uptime
Ensure applications and infrastructure meet reliability goals (like 99.9% uptime).
Set and track SLAs (Service Level Agreements), SLOs (Service Level Objectives), and SLIs (Service Level Indicators).
Investigate outages and performance issues and implement fixes to prevent recurrence.
2. Automation & Tooling
Automate repetitive operational tasks (like deployments, monitoring setups, and scaling).
Write scripts and code (often in Python, Bash, or Go) to improve processes.
Build CI/CD pipelines to automate code deployment.
3. Monitoring & Observability
Set up monitoring tools (like Datadog, Prometheus, Dynatrace, etc.) to track system health.
Develop dashboards and alerts to detect problems before they impact users.
Analyze logs, traces, and metrics to uncover performance bottlenecks.
4. Incident Management
Act as first responders during production incidents.
Lead post-mortems to find root causes and prevent future issues.
Collaborate with development and infrastructure teams to resolve problems quickly.
5. Performance Optimization
Continuously assess system and application performance.
Identify ways to improve latency, throughput, and resource utilization.
Conduct load testing and chaos engineering to simulate failures.
6. Infrastructure Management
Provision and manage cloud infrastructure (AWS, Azure, GCP) using Infrastructure as Code (IaC) tools like Terraform.
Ensure systems are secure, cost-efficient, and scalable.
7. Collaboration
Work closely with developers, QA, security teams, and product teams to embed reliability into application design.
Educate teams on best practices for reliability and operational excellence.
Skills Needed
Coding & scripting (Python, Bash, etc.)
Cloud platforms (AWS, Azure, GCP)
Monitoring & Observability (Prometheus, Datadog, Dynatrace)
CI/CD tools (Jenkins, GitHub Actions, ArgoCD)
Incident response & troubleshooting
Strong knowledge of Linux systems
Networking basics (DNS, Load Balancing, etc.)
Communication & documentation skills
Overall Goal
The goal of an SRE is to ensure that systems are reliable, scalable, and maintainable - with minimal manual intervention.