SRE Technical Lead
Apply NowCompany: RICEFW Technologies, Inc.
Location: Philadelphia, PA 19120
Description:
Job Title: SRE Technical Lead
Location: Philly, PA
Responsibilities:
Observability and Monitoring:
Site Reliability Engineering:
Incident Management:
Automation and Tooling:
Collaboration and Leadership:
Qualifications:
Preferred Qualifications:
Basic Qualification : null
Location: Philly, PA
Responsibilities:
- Develop and implement robust observability strategies, including logging, metrics, and tracing, to gain deep insights into the performance and health of our systems.
- Collaborate with cross-functional teams to establish and enforce best practices for instrumentation, logging, and monitoring throughout the software development lifecycle.
- Lead initiatives to improve the reliability, availability, and scalability of our applications and infrastructure.
- Collaborate with development teams to design and implement systems that are resilient to failures and capable of quick recovery.
- Drive the adoption of SRE principles and practices across the organization.
- Develop and refine incident response processes, ensuring timely detection, analysis, and resolution of incidents.
- Collaborate with teams to conduct post-incident reviews, identify root causes, and implement preventive measures.
- Build and maintain automation tools for deployment, monitoring, and incident response to streamline operational processes.
- Evaluate and integrate third-party tools to enhance observability and SRE capabilities.
- Provide technical leadership and mentorship to the engineering team.
- Collaborate with product managers, architects, and other stakeholders to align observability and SRE initiatives with business goals.
Qualifications:
- Bachelor's or higher degree in Computer Science, Software Engineering, or a related field.
- Extensive experience in software engineering with a focus on observability, monitoring, and SRE.
- Strong expertise in designing and implementing distributed systems for high availability and reliability.
- Proficiency in APM (Application performance monitoring), RUM (Real user monitoring), Synthetics, correlation, alert & incident management (e.g., OTEL, Jaeger, Kloudfuse, service-now).
- Proficiency in one or more programming languages (e.g., Java, Python, Go).
- Experience with cloud platforms (e.g., AWS, Azure, GCP) and container orchestration (e.g., Kubernetes).
- In-depth knowledge of observability tools and frameworks (e.g., Prometheus, Grafana, ELK stack, Datadog, Aternity) and incident management processes.
- In-depth knowledge of ML & AI frameworks (e.g., Anomaly, Outlier, AIOps, LLM).
- Excellent communication and collaboration skills.
- Demonstrated ability to lead technical initiatives and mentor team members.
Preferred Qualifications:
- Certifications in relevant areas such as AWS Certified DevOps Engineer, Certified Kubernetes Administrator (CKA), or equivalent.
- Previous experience in a leadership or management role.
- Familiarity with Infrastructure as Code (IaC) tools such as Terraform, Packer & C Crossplane.
Basic Qualification : null