Monitoring Manager
Apply NowCompany: Scicom Infrastructure Services, Inc.
Location: Atlanta, GA 30349
Description:
Key Responsibilities:
Tool Management: Oversee the implementation and operation of observability and monitoring tools (e.g., AppDynamics, Splunk, OBM, Tivoli, Open Telemetry, Solarwinds, xMatters, Prometheus, Grafana etc.) to track metrics, logs, and traces. System Health Monitoring: Develop strategies for real-time monitoring of systems to ensure high availability, performance, and fault tolerance. Incident Management: Lead incident detection and resolution by ensuring timely alerts and diagnosis of system issues, minimizing downtime. Data-Driven Insights: Use monitoring data to identify trends, anomalies, and potential bottlenecks in systems to optimize performance. Collaboration: Work with DevOps, SRE (Site Reliability Engineering), and engineering teams to design monitoring strategies and ensure proper instrumentation of applications. SLAs & SLOs Management: Ensure that Service Level Agreements (SLAs) and Service Level Objectives (SLOs) are met through proactive monitoring and reporting. Process Improvement: Continuously evolve and refine observability processes and tools to meet the growing demands of the infrastructure. Team Leadership: Manage a team of monitoring engineers, assign tasks, and oversee the execution of observability initiatives. Automation: Drive automation in monitoring and alerting to reduce manual efforts and improve the reliability of alerts.
Skills and Qualifications:
Skills and Qualifications:
- Expertise in monitoring tools and observability platforms (AppDynamics, Splunk, OBM, Tivoli, Open Telemetry, Solarwinds, xMatters, Prometheus, Grafana, etc.)
- Strong understanding of system architecture, cloud infrastructure (AWS, GCP, Azure), and containerization (Kubernetes, Docker).
- Experience with incident management and troubleshooting in complex environments.
- Leadership experience with cross-functional teams.
- Proficiency in scripting languages like Python, Bash, or similar for automation.
- Knowledge of best practices in logging, monitoring, and distributed tracing.
- Strong analytical skills with a focus on data interpretation for performance tuning.