Alibaba Cloud-Site Reliability Engineer-Database-NoSQL platform-Sunnyvale

Apply Now

Company: Alibaba Cloud

Location: Sunnyvale, CA 94087

Description:

Job Description

We are the NoSQL team of Alibaba Cloud Intelligent Database Division, responsible for the research and development of NoSQL database products, with breakthroughs and implementations in multiple directions such as cloud-native, Serverless, and software-hardware integration. We provide critical foundational data services for industries like finance, government-enterprise, gaming, and education to migrate to the cloud.

In terms of product technology, we continuously strive for the pinnacle of industry-leading techniques, aiming to implement emerging large-scale application scenarios such as intelligent search on NoSQL databases. We pursue advancements in new architectures, hardware, and concepts for NoSQL databases, accumulating leading technological expertise in areas including database storage engines, distributed processing, Serverless, GPU hardware acceleration, and NVM novel storage media. We have published numerous papers at top-tier industry conferences.

Currently, we serve a wide range of customers across different industries on Alibaba Cloud, addressing various technical challenges. Here, you will encounter peak moments with hundreds of millions of accesses, complex and intertwined scenario demands from diverse businesses, operational support for clusters of tens of thousands of servers, and challenges related to global business expansion. Here, you can continuously solve these technical challenges, address pain points, and make optimizations to the system. Here, you can delve into emerging nosql databases and related technologies, and contribute to their enhancement. Our goal is to build and refine distributed caching and KV database products that support ultra-high traffic, stability with low latency, Serverless elasticity, and ease of operation.

We are looking for a Site Reliability Engineer (SRE) specialized in the database domain to support the stable operation of Alibaba Cloud's NoSQL platform . This role combines software and systems engineering to ensure the reliable operation of Alibaba Cloud's database NoSQL platform, providing stable NoSQL database services to customers. Responsibilities include but are not limited to:
Ensuring System Stability and High Availability: Responsible for health checks of components within the database foundational platform, developing maintenance tools for routine inspections, identifying and resolving potential risks in advance.
Development of Operations Platforms and Tools: Design and implement automated operations platforms that can maintain large-scale distributed systems. Monitor and maintain various operational metrics, optimizing the system through data analysis. Participate in solving issues related to capacity, performance, and stability in production systems, designing and implementing automated operations platforms for large-scale distributed systems.
Ensuring System Stability and High Availability: Design and implement high-availability systems, such as automatic fault localization, automatic recovery, adaptive disaster recovery, and implementation of cloud-native technologies, to ensure continuous business availability.
Incident Handling and Emergency Response: During major events like promotional sales, ensure smooth user experience under massive peak loads while maintaining cost control. Handle live network issues, including fault diagnosis, disaster recovery, intelligent scheduling, elastic scaling, and anti-attack measures.
Close Collaboration with Development Teams: Work closely with product teams to promptly identify and optimize technical architectures, improving service response latency and performance, and enhancing service availability. Actively participate in discussions and designs of business solutions, promoting optimization and improvement of services.

Position Requirement
Bachelor's degree in Computer Science, or a related technical field, or equivalent practical experience.
4+ years of work experience in Site Reliability Engineer within the domain of databases or other cloud products.
Familiar with the basic principles of the Linux kernel, common tools and commands, and has good skills in diagnostics and optimization.
Proficient in at least one or more of the following languages: Java, Python, Go, C++, with experience in developing operations and maintenance tools.
Familiar with open-source cloud platforms such as Kubernetes, OpenStack, and CloudFoundry.
Experience with relational databases like MySQL, SQL Server, and PostgreSQL, as well as open-source databases and queue products like Redis, MongoDB, HBase, Cassandra, Kafka, and Elasticsearch, with knowledge of their principles or operational experience being a plus.
Requires experience in operating large-scale distributed systems, with proficiency in at least one major cloud platform.
Excellent problem-solving and analytical skills.

The pay range for this position at commencement of employment is expected to be between $133,200/year and $219,600/year. However, base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience.

If hired, employee will be in an "at-will position" and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors.

Similar Jobs