Identify, craft, and maintain SLIs and SLOs for teams, as well as metrics such as MTTR, Lead time for change, Deployment Frequency and Change Failure Rate
Work with Application teams to set up Observability, Telemetry
Define what it means for a service to be available and develop, monitor, and alert on SLIs/SLOs
Define, track, and enforce error budgets
Review code instrumentation with development teams and ensure necessary dashboards are created to monitor SLI/SLO/SLAs
Establish, test, and tune alerting for varying tiers of applications
Document and maintain runbooks and procedures, automate as much as possible
Plan and execute periodic Disaster Recovery exercises including both tabletop and simulated failures (fault injection)
Perform periodic load and scalability testing to establish baselines, drift, and capacity planning
Design and implement peak readiness reviews for anticipated high-volume times
Participate in quarterly business and operational reviews aligning on roadmaps, development velocity, efficiency, growth trends, etc
Requirements :
5+ years of SRE or Systems Engineering experience
Experience with Any SRE tool, (Grafana, Dynatrace, Splunk are preferable)
Experience with Distributed tracing
Experience with establishing hooks into CI/CD pipeline in lower environments for SRE violations
Soft Skills:
Ability to work independently and as part of a team
Strong analytical and problem-solving mindset combined with experience troubleshooting under pressure
Strategic thinking, complex problem solving and analytical capabilities
Strong organizational and interpersonal skills, with experience developing and instilling a culture of operational maturity