Alibaba Cloud-Cloud Native/Serverless Reliability Engineer (SRE)-Sunnyvale

Apply Now

Company: Alibaba Cloud

Location: Sunnyvale, CA 94087

Description:

Job Description

The Alibaba Cloud Cloud Native Serverless Team is a leading innovation force within Alibaba Cloud, dedicated to empowering developers and enterprises with cutting-edge serverless technologies. Focused on building scalable, cost-efficient, and fully managed serverless solutions, the team drives the evolution of cloud-native architectures by abstracting infrastructure complexity and enabling seamless integration with modern application development paradigms. Delivering industry-leading serverless solutions that directly compete with AWS Lambda and other global cloud providers.

Cloud Product Operations & Reliability
Oversee stability maintenance, performance tuning, and high-availability architecture design for serverless system components. Ensure 24/7 reliability of mission-critical systems.
Manage containerized lifecycle on serverless clusters: Implement deployments, auto-scaling, version upgrades, and resource optimization in serverless environments.

Incident Response & Root Cause Analysis
Lead troubleshooting of serverless, middleware, cloud products related incidents (e.g., key-value storage, message backlog, service registration failures) through log analysis, distributed tracing, and monitoring systems.
Develop diagnostic tools using Go/Rust to resolve production issues, performance bottlenecks, and compatibility challenges.

Automation & Operational Excellence
Build automation tools to standardize serverless system deployment, monitoring, and disaster recovery.
Implement chaos engineering experiments, capacity planning strategies, and failover mechanisms to enhance system resilience.

Collaboration & Best Practices
Partner with teams to optimize cloud product adoption strategies and deliver architecture design consultation.
Create comprehensive technical documentation and drive standardization of serverless operations.

Position Requirement

Minimum qualification:
Bachelor's+ in Computer Science with 3+ years in SRE/serverless operations.
Deep understanding of SRE principles: Balancing reliability metrics (SLIs/SLOs) with engineering velocity.
Proven ability to diagnose complex distributed system failures under pressure.
Excellent communication skills to drive cross-team collaboration and technical documentation.

Preferred qualification:
Experience modifying cloud-based product source code for performance optimization, serverless experience is preferred.
Expertise in large-scale distributed systems (10k+ topics, 1k+ node clusters).
Kubernetes certifications (CKA/CKAD) or cloud provider certifications.

The pay range for this position at commencement of employment is expected to be between $104,400 and $171,000/year. However, base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience.

If hired, employee will be in an "at-will position" and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors.

Similar Jobs