Tech Lead Manager, Site Reliability Engineer, Product - USDS
Apply NowCompany: TikTok
Location: San Jose, CA 95123
Description:
Responsibilities
The USDS TikTok Product Engineering SRE team works with engineering and product teams to build, maintain and run large-scale, globally distributed, observable, fault-tolerant systems. SREs on this team will deliver on production ownership and be responsible for observability and automation across complex, large-scale service mesh architectures.
In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department. We regularly review our hybrid work model, and the specific requirements may change at any time.
Responsibilities:
- Provide technical leadership and mentorship to a team of Site Reliability Engineers focused on building observable, fault-tolerant systems
- Drive architectural decisions for large-scale, globally distributed service mesh architectures
- Establish and maintain production ownership models, incident response protocols, and service level objectives
- Develop strategic roadmaps for observability and automation initiatives that enhance system reliability
- Balance technical contributions with people management responsibilities, including career development, performance evaluations, and team growth
- Foster a culture of reliability, continuous improvement, and knowledge sharing within your team and across the organization
- Lead security initiatives to safeguard critical assets, partnering with security and compliance teams to implement robust protocols that ensure data protection and regulatory compliance across all services
Qualifications
Minimum Qualifications:
- 5+ years of experience and expertise in designing, analyzing, and troubleshooting large-scale distributed systems, relational databases, caching solutions and web service frameworks
- Previous experience leading a small to mid-size team while maintaining significant "hands-on" technical contributions
- Strong understanding of Unix/Linux operating systems internals and networking fundamentals
- Proficiency in writing production-grade code in Go, Python, Java or similar languages
- Proven track record of establishing and implementing SRE best practices across engineering organizations
- Experience developing and maintaining service level objectives (SLOs) and error budgets
Preferred Qualifications:
- Deep expertise in algorithms, data structures, and systems design with proven ability to architect complex technical solutions
- Track record of developing sophisticated automation tools and developer-friendly APIs that streamline operations and eliminate toil
- Exceptional analytical mindset with demonstrated success solving intricate technical problems across distributed systems
- Extensive experience running high-availability web services at massive scale, with comprehensive knowledge of cloud-native architectures and advanced networking concepts
- Proven ability to lead and collaborate effectively with globally distributed engineering teams across multiple time zones and cultural contexts
- Strategic vision to balance immediate operational needs with long-term reliability and scalability objectives
- Success in designing and implementing observability solutions for complex distributed systems
The USDS TikTok Product Engineering SRE team works with engineering and product teams to build, maintain and run large-scale, globally distributed, observable, fault-tolerant systems. SREs on this team will deliver on production ownership and be responsible for observability and automation across complex, large-scale service mesh architectures.
In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their manager/department. We regularly review our hybrid work model, and the specific requirements may change at any time.
Responsibilities:
- Provide technical leadership and mentorship to a team of Site Reliability Engineers focused on building observable, fault-tolerant systems
- Drive architectural decisions for large-scale, globally distributed service mesh architectures
- Establish and maintain production ownership models, incident response protocols, and service level objectives
- Develop strategic roadmaps for observability and automation initiatives that enhance system reliability
- Balance technical contributions with people management responsibilities, including career development, performance evaluations, and team growth
- Foster a culture of reliability, continuous improvement, and knowledge sharing within your team and across the organization
- Lead security initiatives to safeguard critical assets, partnering with security and compliance teams to implement robust protocols that ensure data protection and regulatory compliance across all services
Qualifications
Minimum Qualifications:
- 5+ years of experience and expertise in designing, analyzing, and troubleshooting large-scale distributed systems, relational databases, caching solutions and web service frameworks
- Previous experience leading a small to mid-size team while maintaining significant "hands-on" technical contributions
- Strong understanding of Unix/Linux operating systems internals and networking fundamentals
- Proficiency in writing production-grade code in Go, Python, Java or similar languages
- Proven track record of establishing and implementing SRE best practices across engineering organizations
- Experience developing and maintaining service level objectives (SLOs) and error budgets
Preferred Qualifications:
- Deep expertise in algorithms, data structures, and systems design with proven ability to architect complex technical solutions
- Track record of developing sophisticated automation tools and developer-friendly APIs that streamline operations and eliminate toil
- Exceptional analytical mindset with demonstrated success solving intricate technical problems across distributed systems
- Extensive experience running high-availability web services at massive scale, with comprehensive knowledge of cloud-native architectures and advanced networking concepts
- Proven ability to lead and collaborate effectively with globally distributed engineering teams across multiple time zones and cultural contexts
- Strategic vision to balance immediate operational needs with long-term reliability and scalability objectives
- Success in designing and implementing observability solutions for complex distributed systems