SRE with strong Python Automation experience

Apply Now

Company: Clifyx

Location: Dallas, TX 75217

Description:

Key Responsibilities

Develop Python-based automation solutions to streamline on-prem and cloud infrastructure management on GCP and Kubernetes.
Continuously identify and implement the opportunities to enhance the operational excellence.
Build proactive and innovative solutions that can scale.
Implement and manage configuration automation using Ansible (desirable).
Integrate various tools and services via APIs and client libraries, enabling seamless interoperability across systems.
Enhance deployment reliability by implementing automated chaos strategies, failover mechanisms, and self-healing infrastructure.
Develop proactive monitoring and alerting solutions using tools like Splunk, GCP Operations Suite, Grafana, and Prometheus.
Perform deep root cause analysis (RCA), incident management for complex system failures and develop automation to prevent recurrence.
Work on system resilience and performance tuning, ensuring mission-critical applications run efficiently under high loads.
Apply AI/ML techniques to automation workflows, enhancing anomaly detection, predictive scaling, and intelligent alerting.
Identify and develop AIOps opportunities, reducing operational overhead through intelligent automation.
Experiment with machine learning models to optimize log analysis, monitoring insights, and failure predictions.

Required Skills & Experience

Strong background in Systems Engineering with a focus on automation and reliability.
Proficiency in Python (intermediate to expert level) for developing automation and integrations.
Hands-on expertise with Kubernetes and cloud platforms (GCP or any major cloud).
Experience integrating various tools and platforms via APIs and client libraries.
Deep understanding of monitoring and alerting using Splunk, GCP Operations Suite, Grafana, and Prometheus.
Ability to work in aggressive, high-stakes environments where reliability and uptime are critical.
Strong problem-solving skills, capable of navigating uncertainty and handling complex challenges.
Experience with Ansible for infrastructure automation.
Prior experience working in mission-critical teams handling large-scale, high-availability systems is a plus.
Enthusiasm for AI/ML and AIOps, with a desire to apply it in automation and operations.