Site Reliability Engineer - Recommendation Infrastructure
Apply NowCompany: TikTok
Location: San Jose, CA 95123
Description:
Responsibilities
Our Recommendation Infrastructure Team is responsible for building up and optimizing the architecture for our recommendation system to provide the most stable and best experience for our TikTok users.
SREs in our team keep the systems up and running with the highest level of availability, and create highly automated systems and pipelines.
What You'll Do
Engage in and improve the whole lifecycle of Recommendation systems - from system design consulting through to launch reviews, deployment, operation and refinement
Deliver tools/software to improve the reliability and scalability of services, automate operations and improve R&D efficiency
Build availability of large-scale services deployed across global data centers
Plan, manage and optimize cloud resources utilization, ensuring SLA of large-scale clusters
Measure and monitor availability, latency and overall service health
Practice sustainable incident response and postmortems.
Qualifications
Minimum Qualifications:
Bachelor's degree or above majoring in Computer Science or related fields
Familiar with system operation skills in Linux and network
Experience programming in at least one of the following languages: Python, Perl, Go, or C/C++
Familiar with popular CI/CD procedures and environments
Effective communication skills and a sense of ownership and drive
Preferred Qualifications:
Experience in SRE of large-scale systems deployment with high reliability and scalability.
Experience in designing, analyzing and troubleshooting large-scale distributed systems
Our Recommendation Infrastructure Team is responsible for building up and optimizing the architecture for our recommendation system to provide the most stable and best experience for our TikTok users.
SREs in our team keep the systems up and running with the highest level of availability, and create highly automated systems and pipelines.
What You'll Do
Engage in and improve the whole lifecycle of Recommendation systems - from system design consulting through to launch reviews, deployment, operation and refinement
Deliver tools/software to improve the reliability and scalability of services, automate operations and improve R&D efficiency
Build availability of large-scale services deployed across global data centers
Plan, manage and optimize cloud resources utilization, ensuring SLA of large-scale clusters
Measure and monitor availability, latency and overall service health
Practice sustainable incident response and postmortems.
Qualifications
Minimum Qualifications:
Bachelor's degree or above majoring in Computer Science or related fields
Familiar with system operation skills in Linux and network
Experience programming in at least one of the following languages: Python, Perl, Go, or C/C++
Familiar with popular CI/CD procedures and environments
Effective communication skills and a sense of ownership and drive
Preferred Qualifications:
Experience in SRE of large-scale systems deployment with high reliability and scalability.
Experience in designing, analyzing and troubleshooting large-scale distributed systems