Alibaba Cloud-Cloud Native/Middleware Reliability Engineer (SRE)-Middleware-Sunnyvale

Apply Now

Company: Alibaba Cloud

Location: Sunnyvale, CA 94087

Description:

Job Description

The Alibaba Cloud Cloud-Native Middleware team is responsible for the research and development of distributed software infrastructure and is committed to delivering outstanding API Gateway and microservices solutions to tens of thousands of enterprise customers on Alibaba Cloud, accelerating their cloud migration processes and innovation velocity.

Cloud Product Operations & Reliability
Oversee stability maintenance, performance tuning, and high-availability architecture design for Microservices(Zookeeper/Nacos). Ensure 24/7 reliability of mission-critical systems.
Manage containerized middleware lifecycle on Kubernetes clusters: Implement deployments, auto-scaling, version upgrades, and resource optimization in K8s environments.

Incident Response & Root Cause Analysis
Lead troubleshooting of middleware-related incidents (e.g., message backlog, service registration failures) through log analysis, distributed tracing, and monitoring systems.
Develop diagnostic tools using Java/Go to resolve production issues, performance bottlenecks, and compatibility challenges.

Automation & Operational Excellence
Build Python/Go/Shell automation tools to standardize middleware deployment, monitoring, and disaster recovery workflows.
Implement chaos engineering experiments, capacity planning strategies, and failover mechanisms to enhance system resilience.

Collaboration & Best Practices
Partner with teams to optimize cloud product adoption strategies and deliver architecture design consultation.
Create comprehensive technical documentation and drive standardization of middleware operations.

Position Requirement

Minimum qualification:
Bachelor's+ in Computer Science with 3+ years in SRE/middleware operations.
Deep understanding of SRE principles: Balancing reliability metrics (SLIs/SLOs) with engineering velocity.
Proven ability to diagnose complex distributed system failures under pressure.
Excellent communication skills to drive cross-team collaboration and technical documentation.

Preferred qualification:
Experience modifying middleware source code for performance optimization.
Expertise in large-scale distributed systems (10k+ topics, 1k+ node clusters).
Kubernetes certifications (CKA/CKAD) or cloud provider certifications.

The pay range for this position at commencement of employment is expected to be between $104,400 and $171,000/year. However, base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience.

If hired, employee will be in an "at-will position" and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors.

Similar Jobs