Okta is seeking a Site Reliability Manager (SRE) to lead our Core SRE team.
At Okta our motto is "Always On", and nowhere do we embrace that more than in Technical Operations. We strive to build the most reliable and performant systems on the planet through the skillful use of automation. We've created an integrated system that securely connects any person via any device to the technologies they need to do their most significant work.
The Core SRE team is in the center of our growing production services at Okta. Your team works directly with TPM/QA and Engineering to automate AWS services across the world. The team also leads our edge networking services and plays a key role in a number of new projects
The ideal candidate:
- Has a track record of leading or managing high performing teams whilst still being hands-on.
- Has production experience with AWS cloud-based infrastructure.
- Has operated complex custom applications on UNIX/Linux and/or Enterprise Java platforms
- Is passionate about automation and leveraging agile software development methodologies to deliver automation
Job Duties and Responsibilities:
- Mentor and manage a team of experienced engineers using agile development
- Partner with recruiting to hire staff in our HQ and remote sites
- Manage and own delivery of new infrastructure components:
- Collaborate with TPM, architects and executive management
- Design and code reviews
- Partner with Okta security teams.
- Continuously refine monitoring processes, thresholds, and configuration
- Respond to issues and escalations and participate in a management on-call rotation
- Work closely with product developers to ensure new features have the proper operational support and maintainability
Minimum REQUIRED Knowledge, Skills, and Abilities:
- Demonstrate a track record of leading or managing a team
- Experience with Amazon Web Services and knowledge of AWS networking technologies (VPC/ELB/WAF)
- Experience with managing Linux Systems in production.
- Proficient in at least one scripting language (bash, Perl, Ruby, Python)
- Experience supporting a complex, multi-tier service running in the cloud
- Prior experience in software development, DevOps role, or SRE role