SRE with strong Python Automation experience

Apply Now

Company: Clifyx

Location: Dallas, TX 75217

Description:

Key Responsibilities
  • Develop Python-based automation solutions to streamline on-prem and cloud infrastructure management on GCP and Kubernetes.
  • Continuously identify and implement the opportunities to enhance the operational excellence.
  • Build proactive and innovative solutions that can scale.
  • Implement and manage configuration automation using Ansible (desirable).
  • Integrate various tools and services via APIs and client libraries, enabling seamless interoperability across systems.
  • Enhance deployment reliability by implementing automated chaos strategies, failover mechanisms, and self-healing infrastructure.
  • Develop proactive monitoring and alerting solutions using tools like Splunk, GCP Operations Suite, Grafana, and Prometheus.
  • Perform deep root cause analysis (RCA), incident management for complex system failures and develop automation to prevent recurrence.
  • Work on system resilience and performance tuning, ensuring mission-critical applications run efficiently under high loads.
  • Apply AI/ML techniques to automation workflows, enhancing anomaly detection, predictive scaling, and intelligent alerting.
  • Identify and develop AIOps opportunities, reducing operational overhead through intelligent automation.
  • Experiment with machine learning models to optimize log analysis, monitoring insights, and failure predictions.

Required Skills & Experience
  • Strong background in Systems Engineering with a focus on automation and reliability.
  • Proficiency in Python (intermediate to expert level) for developing automation and integrations.
  • Hands-on expertise with Kubernetes and cloud platforms (GCP or any major cloud).
  • Experience integrating various tools and platforms via APIs and client libraries.
  • Deep understanding of monitoring and alerting using Splunk, GCP Operations Suite, Grafana, and Prometheus.
  • Ability to work in aggressive, high-stakes environments where reliability and uptime are critical.
  • Strong problem-solving skills, capable of navigating uncertainty and handling complex challenges.
  • Experience with Ansible for infrastructure automation.
  • Prior experience working in mission-critical teams handling large-scale, high-availability systems is a plus.
  • Enthusiasm for AI/ML and AIOps, with a desire to apply it in automation and operations.

Similar Jobs