Senior Site Reliability and operations Engineer

Apply Now

Company: Lorven Technologies Inc

Location: Irving, TX 75061

Description:

Role: Senior Site Reliability and operations Engineer (SRE)

Location: Irving, TX - Hybrid position - Only Local Consultants

Job description:

We are looking for a highly skilled Senior Site Reliability and operations Engineer (SRE) with extensive experience in implementation of Kubernetes-based distributed caching and solutions. This role requires a strong foundation in software development, infrastructure automation, reliability engineering and large enterprise scale implantations. Candidate will be responsible for designing, implementing, and maintaining high-performance distributed systems, ensuring reliability, scalability, and efficiency.

Development & Implementation:
Design, develop, and optimize distributed caching and compute grid solutions on Kubernetes/OpenShift
Understanding of microservices and containerized workloads using Kubernetes, Docker, and Helm.
Implement high-throughput compute grid solutions using Apache Ignite, GridGain, Coherence or similar technologies.
Optimize application performance by leveraging caching strategies, load balancing, and efficient data distribution.

Site Reliability Engineering (SRE):
Ensure high availability, scalability, and reliability of distributed systems.
Implement observability, logging, and monitoring using tools like Splunk, Prometheus, Grafana, ELK, or OpenTelemetry.
Automate infrastructure provisioning and deployments using Ansible, and Helm Charts.
Understanding of CI/CD pipelines for seamless software deployment.
Troubleshoot and resolve incidents related to platform, infrastructure and distributed caching and compute grids, ensuring minimal downtime.

Required Skills & Qualifications:
Strong experience in Kubernetes (OpenShift and on-prem/cloud clusters).
Understanding of programming languages like Java, Go, or Python.
Experience with containerization technologies (Docker, Helm, etc.).
Strong knowledge of CI/CD pipelines (Jenkins, ArgoCD, GitHub Actions).
Hands-on experience with observability tools (Prometheus, Grafana, Loki, Jaeger).
Understanding of networking, service meshes (Istio/Linkerd), and security best practices in Kubernetes.
Experience with multi-cluster and hybrid cloud Kubernetes deployments.

Senior Site Reliability and operations Engineer

Similar Jobs