SRE Technical Lead

Apply Now

Company: RICEFW Technologies, Inc.

Location: Philadelphia, PA 19120

Description:

Job Title: SRE Technical Lead

Location: Philly, PA

Responsibilities:
  • Observability and Monitoring:
    • Develop and implement robust observability strategies, including logging, metrics, and tracing, to gain deep insights into the performance and health of our systems.
    • Collaborate with cross-functional teams to establish and enforce best practices for instrumentation, logging, and monitoring throughout the software development lifecycle.
  • Site Reliability Engineering:
    • Lead initiatives to improve the reliability, availability, and scalability of our applications and infrastructure.
    • Collaborate with development teams to design and implement systems that are resilient to failures and capable of quick recovery.
    • Drive the adoption of SRE principles and practices across the organization.
  • Incident Management:
    • Develop and refine incident response processes, ensuring timely detection, analysis, and resolution of incidents.
    • Collaborate with teams to conduct post-incident reviews, identify root causes, and implement preventive measures.
  • Automation and Tooling:
    • Build and maintain automation tools for deployment, monitoring, and incident response to streamline operational processes.
    • Evaluate and integrate third-party tools to enhance observability and SRE capabilities.
  • Collaboration and Leadership:
    • Provide technical leadership and mentorship to the engineering team.
    • Collaborate with product managers, architects, and other stakeholders to align observability and SRE initiatives with business goals.

  • Qualifications:
    • Bachelor's or higher degree in Computer Science, Software Engineering, or a related field.
    • Extensive experience in software engineering with a focus on observability, monitoring, and SRE.
    • Strong expertise in designing and implementing distributed systems for high availability and reliability.
    • Proficiency in APM (Application performance monitoring), RUM (Real user monitoring), Synthetics, correlation, alert & incident management (e.g., OTEL, Jaeger, Kloudfuse, service-now).
    • Proficiency in one or more programming languages (e.g., Java, Python, Go).
    • Experience with cloud platforms (e.g., AWS, Azure, GCP) and container orchestration (e.g., Kubernetes).
    • In-depth knowledge of observability tools and frameworks (e.g., Prometheus, Grafana, ELK stack, Datadog, Aternity) and incident management processes.
    • In-depth knowledge of ML & AI frameworks (e.g., Anomaly, Outlier, AIOps, LLM).
    • Excellent communication and collaboration skills.
    • Demonstrated ability to lead technical initiatives and mentor team members.

    Preferred Qualifications:
    • Certifications in relevant areas such as AWS Certified DevOps Engineer, Certified Kubernetes Administrator (CKA), or equivalent.
    • Previous experience in a leadership or management role.
    • Familiarity with Infrastructure as Code (IaC) tools such as Terraform, Packer & C Crossplane.

    Basic Qualification : null

    Similar Jobs