SRE with strong Python Automation experience
Apply NowCompany: Clifyx
Location: Dallas, TX 75217
Description:
Key Responsibilities
Required Skills & Experience
- Develop Python-based automation solutions to streamline on-prem and cloud infrastructure management on GCP and Kubernetes.
- Continuously identify and implement the opportunities to enhance the operational excellence.
- Build proactive and innovative solutions that can scale.
- Implement and manage configuration automation using Ansible (desirable).
- Integrate various tools and services via APIs and client libraries, enabling seamless interoperability across systems.
- Enhance deployment reliability by implementing automated chaos strategies, failover mechanisms, and self-healing infrastructure.
- Develop proactive monitoring and alerting solutions using tools like Splunk, GCP Operations Suite, Grafana, and Prometheus.
- Perform deep root cause analysis (RCA), incident management for complex system failures and develop automation to prevent recurrence.
- Work on system resilience and performance tuning, ensuring mission-critical applications run efficiently under high loads.
- Apply AI/ML techniques to automation workflows, enhancing anomaly detection, predictive scaling, and intelligent alerting.
- Identify and develop AIOps opportunities, reducing operational overhead through intelligent automation.
- Experiment with machine learning models to optimize log analysis, monitoring insights, and failure predictions.
Required Skills & Experience
- Strong background in Systems Engineering with a focus on automation and reliability.
- Proficiency in Python (intermediate to expert level) for developing automation and integrations.
- Hands-on expertise with Kubernetes and cloud platforms (GCP or any major cloud).
- Experience integrating various tools and platforms via APIs and client libraries.
- Deep understanding of monitoring and alerting using Splunk, GCP Operations Suite, Grafana, and Prometheus.
- Ability to work in aggressive, high-stakes environments where reliability and uptime are critical.
- Strong problem-solving skills, capable of navigating uncertainty and handling complex challenges.
- Experience with Ansible for infrastructure automation.
- Prior experience working in mission-critical teams handling large-scale, high-availability systems is a plus.
- Enthusiasm for AI/ML and AIOps, with a desire to apply it in automation and operations.