Cloud Native (Azure) SRE
Apply NowCompany: Omni Inclusive
Location: Dallas, TX 75217
Description:
Cloud Native (Azure) SRE L2 EngineerJob Summary:We are looking for an SRE L2 Engineer to support and maintain our Azure cloud-native infrastructure, ensuring high availability, performance, and security. The ideal candidate will be responsible for monitoring, troubleshooting, incident resolution, and automation of cloud services, working closely with L3 and engineering teams to improve system reliability.Key Responsibilities: Incident & Problem Management:o Monitor and troubleshoot Azure cloud services, Kubernetes clusters, and Linux-based workloads.o Provide L2 support for production incidents, perform root cause analysis (RCA), and escalate to L3 when needed.o Respond to and resolve alerts using Azure Monitor, Prometheus, Grafana, and Splunk. Automation & Infrastructure Management:o Assist in managing Infrastructure as Code (Terraform, Bicep, ARM Templates) for deployments.o Develop and maintain automation scripts in Bash, PowerShell, or Python for operational tasks. Performance Monitoring & Optimization:o Monitor system health, logs, and performance metrics, ensuring optimal uptime.o Optimize resources and support cost-saving initiatives in Azure. Security & Compliance:o Follow best practices for security hardening, patching, and vulnerability management.o Support compliance efforts (ISO27001, PCI DSS, GDPR, HIPAA). Collaboration & Documentation:o Work closely with L3, DevOps, and engineering teams to implement fixes and improvements.o Document SOPs, incident reports, and troubleshooting guides for knowledge sharing.Required Qualifications & Skills:Technical Expertise: 3+ years of experience in SRE, DevOps, or Linux Administration. Strong knowledge of Azure cloud services (VMs, VNET, NSG, Load Balancers, Storage). Experience with monitoring tools (Azure Monitor, Grafana, Prometheus, Splunk, ELK). Hands-on experience with Linux system administration, troubleshooting, and automation. Familiarity with Kubernetes (AKS), Docker, and containerized workloads. Scripting knowledge in Bash, PowerShell, or Python.Operational Excellence: Experience working with incident management tools (ServiceNow, Jira, PagerDuty). Knowledge of ITIL processes for incident, problem, and change management.Security & Compliance: Understanding of IAM, RBAC, Key Vault, and security policies in Azure. Basic knowledge of firewalls, networking, and encryption mechanisms.Preferred Skills (Good to Have): Knowledge of Ansible, Puppet, or Chef for configuration management. Experience with Azure DevOps, Jenkins, or GitOps workflows. Exposure to multi-cloud environments (AWS, GCP). Understanding of VoIP, SIP, RTP, and telephony technologies. (1.) To provide support for on call escalations and doing root cause analysis of given issue (2.) To independently resolve tickets within agreed SLA of ticket volume and time (3.) To adhere to quality standards, regulatory requirements and company policies (4.) Work on value adding activities such Knowledge base update & management, Training freshers, coaching analysts (5.) To ensure positive customer experience and CSAT through First Call Resolution and minimum rejected resolutions / Reopen Cases (6.) To provide support for on call escalations and doing root cause analysis of given issue (7.) To independently resolve tickets within agreed SLA of ticket volume and time (8.) To adhere to quality standards, regulatory requirements and company policies (9.) Work on value adding activities such Knowledge base update & management, Training freshers, coaching analysts (10.) To ensure positive customer experience and CSAT through First Call Resolution and minimum rejected resolutions / Reopen Cases