Alibaba Cloud-Site Reliability Engineer-Database-Foundational Platform-Sunnyvale

Apply Now

Company: Alibaba Cloud

Location: Sunnyvale, CA 94087

Description:

Job Description

The DBaaS Pluggable Service(DPS) Team at Alibaba Cloud is dedicated to ensuring the stability and evolution of database control plane services. We manage critical components like database network topology, internal microservices infrastructure, and metadata services. Additionally, we handle data plane vertical services such as high-availability guarantees, monitoring, and billing for database products.

Our mission is to optimize these systems for performance, efficiency, and reliability, supporting both internal teams and external customers. Join us to innovate in the heart of Alibaba Cloud's database ecosystem.

1. Platform Stability & High Availability
Conduct health checks, risk assessments, and preventive maintenance for database platform components.
Design and implement HA solutions (e.g., automated fault recovery, adaptive disaster resilience) and cloud-native technologies.
Optimize network architecture and Kubernetes (k8s) cluster operations for database services.

2. Operational Tooling & Automation
Develop platforms/tools for large-scale distributed systems management, including automated deployment, monitoring, and diagnostics.
Enhance observability through metrics, logging, tracing, and alerting systems (e.g., Prometheus, Grafana, OpenTelemetry).

3. Incident Management & Optimization
Resolve live-site issues, including performance bottlenecks, capacity scaling, and security threats.
Collaborate with product teams to refine architectures, reduce latency, and improve availability.

4. Cross-Functional Collaboration
Drive standardization of control-plane components (e.g., microservice frameworks, metadata services) across database engines.

Position Requirement

Minimum qualification:

- Bachelor's degree in Computer Science, Engineering, or related field, or equivalent experience.

- 4+ years in SRE/DevOps roles, preferably with cloud databases or distributed systems.

- Technical Expertise:
Database theory (ACID, CAP, replication, sharding) and hands-on experience with MySQL, PostgreSQL, Redis, or similar.
Proficiency in Kubernetes operations, Linux systems, and network fundamentals (TCP/IP, DNS, load balancing).
Development skills in Python/Go/Java for automation tooling.
Familiarity with observability tools (e.g., Prometheus, ELK stack) and microservice frameworks (e.g., Spring Cloud, gRPC).
Experience with major cloud platforms (AWS/Azure/GCP/Alibaba Cloud).

Preferred qualification:

- Certifications: CKA/CKAD, AWS/Azure certifications, or database-specific credentials (e.g., MongoDB, Redis).

- Knowledge of database HA solutions (e.g., Pacemaker, Patroni) and backup/recovery mechanisms.

The pay range for this position at commencement of employment is expected to be between $133,200/year and $219,600/year. However, base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience.

If hired, employee will be in an "at-will position" and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors.

Similar Jobs