Alibaba Cloud-Site Reliability Engineering (SRE) Specialist-Seattle
Apply NowCompany: Alibaba Cloud
Location: Seattle, WA 98115
Description:
Job Description
Elastic Compute Service (ECS) is a core product of Alibaba Cloud. The Elastic Compute team is dedicated to building world-leading cloud computing infrastructure. As a key component of Alibaba Cloud's self-developed Apsara operating system , Elastic Compute Service (ECS) provides full-stack computing resources covering virtual machine instances, container services and Heterogeneous computing clusters.
Through technological innovation and product optimization, the Alibaba Cloud Elastic Compute team continuously drives advancements in cloud computing technologies, delivering high-quality computing services to users worldwide
. Our goal is not only to support enterprises in achieving elastic scalability but also to deeply empower infrastructure innovation in the New era . Our mission is to build an intelligent foundation of "Computing as a Service," enabling developers to focus on businesses to concentrate on breakthroughs, without worrying about the complex engineering implementations from chips to clusters .
SRE Team:
The Alibaba Cloud Elastic Compute Service (ECS) SRE (Site Reliability Engineering) team is a critical force in ensuring system stability and reliability. The SRE team focuses on guaranteeing the high availability, high performance, and robust stability of ECS products through technical expertise and innovation.
The Alibaba Cloud ECS SRE team is not only a core technical safeguard but also a driver of technological innovation and continuous optimization . By leveraging technical capabilities and collaborative teamwork, we ensure the stability and reliability of ECS products, safeguarding global customers' businesses. Additionally, we are committed to advancing cloud computing technologies through knowledge sharing and industry collaboration .
Joining the Alibaba Cloud ECS SRE team offers the opportunity to engage in the development and optimization of world-leading cloud computing technologies, while growing alongside a passionate and creative team.
1.Responsible for the delivery and operation/maintenance of various clusters, and participate in the architecture design and construction of the infrastructure operation platform.
2.Establish and optimize operation/maintenance service systems to achieve product stability and SLA goals.
3.Develop delivery standards, document maintenance specifications, and enhance daily work efficiency through tool platforms.
4.This position involves on-call responsibilities, requiring timely customer response within Service Level Agreement (SLA) timeframes, driving issue resolution and improving customer experience.
Position Requirement
1.5+ years of operation and maintenance (O&M) experience in IT, internet, or cloud computing industries;
2.Proficient in Linux operating systems and mainstream protocols (e.g., TCP/IP), with solid hands-on experience in troubleshooting OS and network issues.
3.Familiar with containerization and orchestration technologies such as Kubernetes, Slurm, and LSF.
4.Ability to analyze and document technical issues systematically, develop tools/systems to optimize workflows, and improve operational efficiency through automation and platform-based solutions.
5.Strong self-driven learning capabilities, excellent communication skills, and experience leading cross-team projects. Results-driven and action-oriented, with a commitment to excellence.
The pay range for this position at commencement of employment is expected to be between $133,200/year and $219,600/year. However, base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience.
If hired, employee will be in an "at-will position" and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors.
Alibaba U.S. based full time regular employees have access to medical, dental, and vision insurance, a 401(k) plan and basic life insurance, and wellbeing benefits like FSA, subject to the terms and conditions of the applicable plans then in effect. U.S. based employees are also eligible to receive up to 12 paid holidays, accrue up to 15 paid vacation days for this position, and receive up to 72 hours paid sick time (front-loaded) per calendar year.
Elastic Compute Service (ECS) is a core product of Alibaba Cloud. The Elastic Compute team is dedicated to building world-leading cloud computing infrastructure. As a key component of Alibaba Cloud's self-developed Apsara operating system , Elastic Compute Service (ECS) provides full-stack computing resources covering virtual machine instances, container services and Heterogeneous computing clusters.
Through technological innovation and product optimization, the Alibaba Cloud Elastic Compute team continuously drives advancements in cloud computing technologies, delivering high-quality computing services to users worldwide
. Our goal is not only to support enterprises in achieving elastic scalability but also to deeply empower infrastructure innovation in the New era . Our mission is to build an intelligent foundation of "Computing as a Service," enabling developers to focus on businesses to concentrate on breakthroughs, without worrying about the complex engineering implementations from chips to clusters .
SRE Team:
The Alibaba Cloud Elastic Compute Service (ECS) SRE (Site Reliability Engineering) team is a critical force in ensuring system stability and reliability. The SRE team focuses on guaranteeing the high availability, high performance, and robust stability of ECS products through technical expertise and innovation.
The Alibaba Cloud ECS SRE team is not only a core technical safeguard but also a driver of technological innovation and continuous optimization . By leveraging technical capabilities and collaborative teamwork, we ensure the stability and reliability of ECS products, safeguarding global customers' businesses. Additionally, we are committed to advancing cloud computing technologies through knowledge sharing and industry collaboration .
Joining the Alibaba Cloud ECS SRE team offers the opportunity to engage in the development and optimization of world-leading cloud computing technologies, while growing alongside a passionate and creative team.
1.Responsible for the delivery and operation/maintenance of various clusters, and participate in the architecture design and construction of the infrastructure operation platform.
2.Establish and optimize operation/maintenance service systems to achieve product stability and SLA goals.
3.Develop delivery standards, document maintenance specifications, and enhance daily work efficiency through tool platforms.
4.This position involves on-call responsibilities, requiring timely customer response within Service Level Agreement (SLA) timeframes, driving issue resolution and improving customer experience.
Position Requirement
1.5+ years of operation and maintenance (O&M) experience in IT, internet, or cloud computing industries;
2.Proficient in Linux operating systems and mainstream protocols (e.g., TCP/IP), with solid hands-on experience in troubleshooting OS and network issues.
3.Familiar with containerization and orchestration technologies such as Kubernetes, Slurm, and LSF.
4.Ability to analyze and document technical issues systematically, develop tools/systems to optimize workflows, and improve operational efficiency through automation and platform-based solutions.
5.Strong self-driven learning capabilities, excellent communication skills, and experience leading cross-team projects. Results-driven and action-oriented, with a commitment to excellence.
The pay range for this position at commencement of employment is expected to be between $133,200/year and $219,600/year. However, base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience.
If hired, employee will be in an "at-will position" and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors.
Alibaba U.S. based full time regular employees have access to medical, dental, and vision insurance, a 401(k) plan and basic life insurance, and wellbeing benefits like FSA, subject to the terms and conditions of the applicable plans then in effect. U.S. based employees are also eligible to receive up to 12 paid holidays, accrue up to 15 paid vacation days for this position, and receive up to 72 hours paid sick time (front-loaded) per calendar year.