Monitoring Manager

Apply Now

Company: Scicom Infrastructure Services, Inc.

Location: Atlanta, GA 30349

Description:

Key Responsibilities:
  • Tool Management: Oversee the implementation and operation of observability and monitoring tools (e.g., AppDynamics, Splunk, OBM, Tivoli, Open Telemetry, Solarwinds, xMatters, Prometheus, Grafana etc.) to track metrics, logs, and traces.
  • System Health Monitoring: Develop strategies for real-time monitoring of systems to ensure high availability, performance, and fault tolerance.
  • Incident Management: Lead incident detection and resolution by ensuring timely alerts and diagnosis of system issues, minimizing downtime.
  • Data-Driven Insights: Use monitoring data to identify trends, anomalies, and potential bottlenecks in systems to optimize performance.
  • Collaboration: Work with DevOps, SRE (Site Reliability Engineering), and engineering teams to design monitoring strategies and ensure proper instrumentation of applications.
  • SLAs & SLOs Management: Ensure that Service Level Agreements (SLAs) and Service Level Objectives (SLOs) are met through proactive monitoring and reporting.
  • Process Improvement: Continuously evolve and refine observability processes and tools to meet the growing demands of the infrastructure.
  • Team Leadership: Manage a team of monitoring engineers, assign tasks, and oversee the execution of observability initiatives.
  • Automation: Drive automation in monitoring and alerting to reduce manual efforts and improve the reliability of alerts.

  • Skills and Qualifications:
    • Expertise in monitoring tools and observability platforms (AppDynamics, Splunk, OBM, Tivoli, Open Telemetry, Solarwinds, xMatters, Prometheus, Grafana, etc.)
    • Strong understanding of system architecture, cloud infrastructure (AWS, GCP, Azure), and containerization (Kubernetes, Docker).
    • Experience with incident management and troubleshooting in complex environments.
    • Leadership experience with cross-functional teams.
    • Proficiency in scripting languages like Python, Bash, or similar for automation.
    • Knowledge of best practices in logging, monitoring, and distributed tracing.
    • Strong analytical skills with a focus on data interpretation for performance tuning.

    Similar Jobs