Site Reliability Engineer, Cloud Efficiency (SRE / Senior)
At Okta our motto is "Always On", and nowhere do we embrace that more than in Technical Operations. We strive to build the most reliable and performant systems on the planet through the skillful use of automation.
If you like to be challenged and have a passion for solving problems at scale with automation, testing and tuning then we would love to hear from you.
For this role we are also looking for someone to begin their SRE career at Okta working on automation in the area of cloud-resource tracking tagging and optimization.
The ideal candidate is someone who exemplifies the ethics of, “If you have to do something more than once, automate it,” and who can rapidly self-educate on new concepts and tools.
You will work on:
- Identifying and automating manual processes
- Designing, building and deploying Okta's production infrastructure with an initial focus on cloud-resource tracking and monitoring
- Promoting and applying best practices for building scalable and reliable services across engineering
- Developing and maintaining technical documentation, runbooks and procedures
- The role may grow towards a full SRE role in future, supporting a 24x7 online environment as part of an on-call rotation
You are an ideal candidate if you:
- Have strong Linux and networking fundamentals
- Have experience automating and deploying large scale production services in AWS (EC2, ECS, Lambda, IAM, KMS, Kinesis, RDS)
- Prefer scripting for operational tooling in Bash, Ruby, Python, Go or similar, experience scripting to Cloud-APIs such as GCP SDK or AWS CLI / API Gateway
- Expertise in tracking & monitoring of Cloud resources including resource tagging, logging and reporting
- Experience with monitoring tools such as Splunk, Wavefront or ELK
- Experience with BI/analytics reporting tools such as Tableau would be a plus
Education and Training:
- B.S. Computer Science (plus) or relevant experience, 4-6 years