Site Reliability Engineering has an exciting and challenging mission: Build, deploy, operate, scale and maintain company-wide platforms (PlaaS) for customer facing Adobe SaaS solutions. While various development groups focus on building our platforms, SRE provides operational/engineering support for both the platform as well as the product teams that leverage the platforms. A capable site reliability engineer (SRE) should have one main high-level objective; identify and solve complex problems through software. This is not a traditional sysadmin/operations role (ie deployments, ticket work, dashboarding, monitoring, incident response). A significant portion of time (~50%) will be some form of programming/development work, preferably to solve self-identified problems. This role will work with the various Adobe product engineering teams and will report to the Engineering Manager of Site Reliability Engineering group.
Areas of Responsibility:
- Ensure the highest level of uptime and Quality of Service (QoS) to Adobe’s customers through operational excellence
- Define service level objectives (SLOs) and service level indicators (SLIs) to represent and measure service quality
- Embed with product teams (physically and/or virtually) to foster strong collaboration/partnership
- Identify areas to improve service resiliency through techniques such as chaos engineering, performance/load testing, etc
- Support and maintain globally distributed, multi-cloud (public and/or private) environments
- Automate common, repeatable tasks at large scale to streamline operational procedures
- Design and maintain production monitoring systems
- Troubleshoot performance and stability issues using a wide variety of tools
- Evaluate and manage application and environment security
- Follow change management processes during implementations
- Use and maintain version control for application infrastructure
- Work in a diverse and global team environment
- Cross-train with other global team members
- Participate in an on-call rotation as required
- Determine root-cause for all production level incidents and write corresponding high-quality RCA reports
- Promote the DevOps/SRE mindset
What you will bring:
- Experience with distributed applications at scale in public cloud (AWS and/or Azure)
- Experience in one (and preferably more) of the following languages: C, C++, Java, Python, Go, Perl or Ruby
- Expertise with containerization orchestration engines (ie Kubernetes, Mesos)
- Working knowledge of modern, continuous development techniques and pipelines (Agile, Kanban, CI/CD, Jenkins, Git, Artifactory)
- Experience working within software development or Internet-related industries, particularly in the context of a SaaS offering.
- B.S. degree in Computer Science or related technical field