Site Reliability Engineer - Parsippany, NJ (Onsite)

Apply Now

Company: Brandon Consulting Associates

Location: Parsippany, NJ 07054

Description:

An SRE bridges the gap between software engineering and IT operations, focusing on building reliable, scalable, and efficient systems. The role was pioneered at Google and has since become common across tech and enterprise companies.

Key Responsibilities

1. System Reliability & Uptime

Ensure applications and infrastructure meet reliability goals (like 99.9% uptime).

Set and track SLAs (Service Level Agreements), SLOs (Service Level Objectives), and SLIs (Service Level Indicators).

Investigate outages and performance issues and implement fixes to prevent recurrence.

2. Automation & Tooling

Automate repetitive operational tasks (like deployments, monitoring setups, and scaling).

Write scripts and code (often in Python, Bash, or Go) to improve processes.

Build CI/CD pipelines to automate code deployment.

3. Monitoring & Observability

Set up monitoring tools (like Datadog, Prometheus, Dynatrace, etc.) to track system health.

Develop dashboards and alerts to detect problems before they impact users.

Analyze logs, traces, and metrics to uncover performance bottlenecks.

4. Incident Management

Act as first responders during production incidents.

Lead post-mortems to find root causes and prevent future issues.

Collaborate with development and infrastructure teams to resolve problems quickly.

5. Performance Optimization

Continuously assess system and application performance.

Identify ways to improve latency, throughput, and resource utilization.

Conduct load testing and chaos engineering to simulate failures.

6. Infrastructure Management

Provision and manage cloud infrastructure (AWS, Azure, GCP) using Infrastructure as Code (IaC) tools like Terraform.

Ensure systems are secure, cost-efficient, and scalable.

7. Collaboration

Work closely with developers, QA, security teams, and product teams to embed reliability into application design.

Educate teams on best practices for reliability and operational excellence.

Skills Needed

Coding & scripting (Python, Bash, etc.)

Cloud platforms (AWS, Azure, GCP)

Monitoring & Observability (Prometheus, Datadog, Dynatrace)

CI/CD tools (Jenkins, GitHub Actions, ArgoCD)

Incident response & troubleshooting

Strong knowledge of Linux systems

Networking basics (DNS, Load Balancing, etc.)

Communication & documentation skills

Overall Goal

The goal of an SRE is to ensure that systems are reliable, scalable, and maintainable - with minimal manual intervention.

Similar Jobs