Lead Site Reliability Engineer (San Jose, CA) (Remote Eligible)

San Jose, CA Tech Ops-610

We are looking for an experienced Lead Site Reliability Engineer to join our Technical Operations team. At Okta, we are "Always On." The core of that starts with this team, ensuring that customers never worry about the Okta service. They strive to build the most reliable and performant systems on the planet. 

We are looking for a smart, innovative, and passionate engineer for this role. Someone who has a passion for designing complex cloud-based network infrastructure on a cloud platform. The ideal candidate is someone that welcomes the challenge and enjoys seeing their designs run at scale with automation, testing, and tuning. If you exemplify the ethics of, "If you have to do something more than once, automate it," we want to hear from you!

Due to federal data handling requirements, candidates must be a US Citizen.

What You'll Do:

  • Lead a team that designs and build Okta's production infrastructure with a focus on networking and security at scale
  • Promote and apply best practices for building scalable and reliable network services across engineering
  • Be a subject matter expert and partner with our team at Amazon Web Services (AWS)
  • Develop and maintain technical documentation, network diagrams, runbooks, and procedures
  • Designing, building, running and monitoring Okta's production infrastructure
  • Driving initiatives to evolve our current platform to increase efficiency and keep it in line with current standards and best practices
  • Responding to production incidents and determining how we can prevent them in the future
  • Identifying and automating manual processes
  • Support a 24x7 online environment as part of an on-call rotation

Qualifications for the role:

  • Have a track record of leading successful SRE/DevOps projects
  • 8+ years of experience with designing large scale solutions
  • 3+ years of experience architecting complex AWS Network-based applications  (VPC, ALB/NLB, EC2, IAM, KMS) 
  • Possess in-depth knowledge in network design, software firewalls, load balancers, and session management
  • Demonstrate strong Linux fundamentals
  • Have exposure to FedRAMP, SOC2 or other compliance programs 
  • 3+ years of experience with automating systems and infrastructure via Ansible, Chef or Terraform
  • Have experience automating and running large scale production services in AWS or other cloud providers
  • Can code to a good standard with any programming language, but especially Ruby, Python or Go, using source control and Agile methodologies
  • Champion excellent written and oral communication skills, with the ability to influence others

Education and Training:

  • BS. Computer Science (plus) or relevant experience

Okta is rethinking the traditional work environment, providing our employees with the flexibility to be their most creative and successful versions of themselves, no matter where the employees located.  We enable a flexible approach to work, meaning you can work from the office or home, regardless of where you live.  Okta invests in the best technologies, and provides flexible benefits and collaborative work environments/experiences, empowering employees to work productively in a setting that best and uniquely suits their needs.  Find your place at Okta https://www.okta.com/company/careers/.

Okta is an equal opportunity employer.




Okta, Inc. is a publicly traded identity and access management company based in San Francisco. It provides cloud software that helps companies manage and secure user authentication into modern applica...

View all jobs
Apply now