Alibaba Cloud-API-Platform SRE-Sunnyvale
Apply NowCompany: Alibaba Cloud
Location: Sunnyvale, CA 94087
Description:
Job Description
Alibaba Cloud Open Platform team is responsible for cloud enterprise-level capabilities such as API Platform, and enterprise solutions like Landing Zone/Well Architected Framework.
Description
Maintaining system reliability and ensuring core system availability is critical for Open Platform. The goal of this role is to establish a system reliability framework that combines technology and management, including but not limited to the following:
1. Develop reliability standards and metrics that cover aspects such as robust architecture design, engineering quality, release management, and production environment operations, ensuring reliability is integrated into the full Alibaba Cloud development lifecycle.
2. Drive major reliability governance initiatives, such as full-stack disaster recovery, gradual rollout, incident response and mitigation (1-5-10), loss prevention etc., to quickly mitigate reliability risks.
3. Build reliability platform that supports change automation, red team/blue team exercises, incident response collaboration, risk scanning, monitoring etc., to simplify reliability engineering.
4. Handle production environment incidents, including incident response, incident coordination, incident detection, incident recovery, and postmortem analysis.
5. Provide technical support to ensure customer business continuity.
Responsibilities
Daily maintenance of applications, databases, and middleware, troubleshooting and addressing customer inquiries;
Collaborate with cloud product teams to develop business critical reliability/oncall plans based on customer requirements for key business periods.
Participate in technical design and implementation of business platforms, identify bottlenecks and propose solutions.
Build high-quality, reusable infrastructure, improve product quality and engineering efficiency.
Stay updated on cutting-edge technologies, and leverage them in the team's services and infrastructure.
Position Requirement
Basic Qualifications
Bachelor's Degree in Computer Science, Information Systems, Computer Engineering or a related field.
5+ years of Systems Engineering, DevOps, Site Reliability Engineering (SRE) or Enterprise Production experience. Understand and follow SRE/DevOps best practices.
3+ years' experience operating in a 24/7 production environment. Proficient with SRE tools, such as at least one scripting language, monitoring tools, IaC tools, etc. Experienced in troubleshooting large-scale distributed systems.
Good team player, able to influence the team and improve team productivity and team morale.
Good communication skills, proficient in Chinese.
Preferred Qualifications
3+ years of experience with cloud computing technologies, in depth understanding and/or hands on experience with at least one of the major cloud areas: Compute/Storage/Network/Database/IAM.
SRE experience in other major cloud providers (e.g. AWS/GCP/Azure).
The pay range for this position at commencement of employment is expected to be between $133,200/year and $219,600/year. However, base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience.
If hired, employee will be in an "at-will position" and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors.
Alibaba Cloud Open Platform team is responsible for cloud enterprise-level capabilities such as API Platform, and enterprise solutions like Landing Zone/Well Architected Framework.
Description
Maintaining system reliability and ensuring core system availability is critical for Open Platform. The goal of this role is to establish a system reliability framework that combines technology and management, including but not limited to the following:
1. Develop reliability standards and metrics that cover aspects such as robust architecture design, engineering quality, release management, and production environment operations, ensuring reliability is integrated into the full Alibaba Cloud development lifecycle.
2. Drive major reliability governance initiatives, such as full-stack disaster recovery, gradual rollout, incident response and mitigation (1-5-10), loss prevention etc., to quickly mitigate reliability risks.
3. Build reliability platform that supports change automation, red team/blue team exercises, incident response collaboration, risk scanning, monitoring etc., to simplify reliability engineering.
4. Handle production environment incidents, including incident response, incident coordination, incident detection, incident recovery, and postmortem analysis.
5. Provide technical support to ensure customer business continuity.
Responsibilities
Daily maintenance of applications, databases, and middleware, troubleshooting and addressing customer inquiries;
Collaborate with cloud product teams to develop business critical reliability/oncall plans based on customer requirements for key business periods.
Participate in technical design and implementation of business platforms, identify bottlenecks and propose solutions.
Build high-quality, reusable infrastructure, improve product quality and engineering efficiency.
Stay updated on cutting-edge technologies, and leverage them in the team's services and infrastructure.
Position Requirement
Basic Qualifications
Bachelor's Degree in Computer Science, Information Systems, Computer Engineering or a related field.
5+ years of Systems Engineering, DevOps, Site Reliability Engineering (SRE) or Enterprise Production experience. Understand and follow SRE/DevOps best practices.
3+ years' experience operating in a 24/7 production environment. Proficient with SRE tools, such as at least one scripting language, monitoring tools, IaC tools, etc. Experienced in troubleshooting large-scale distributed systems.
Good team player, able to influence the team and improve team productivity and team morale.
Good communication skills, proficient in Chinese.
Preferred Qualifications
3+ years of experience with cloud computing technologies, in depth understanding and/or hands on experience with at least one of the major cloud areas: Compute/Storage/Network/Database/IAM.
SRE experience in other major cloud providers (e.g. AWS/GCP/Azure).
The pay range for this position at commencement of employment is expected to be between $133,200/year and $219,600/year. However, base pay offered may vary depending on multiple individualized factors, including market location, job-related knowledge, skills, and experience.
If hired, employee will be in an "at-will position" and the Company reserves the right to modify base salary (as well as any other discretionary payment or compensation program) at any time, including for reasons related to individual performance, Company or individual department/team performance, and market factors.