Implement SRE practices
Identify, craft, and maintain SLIs and SLOs for teams, as well as metrics such as MTTR, Lead time for change, Deployment Frequency and Change Failure Rate
Work with Application teams to set up Observability, Telemetry
Define what it means for a service to be available and develop, monitor, and alert on SLIs/SLOs
Define, track, and enforce error budgets
Review code instrumentation with development teams and ensure necessary dashboards are created to monitor SLI/SLO/SLAs
Establish, test, and tune alerting for varying tiers of applications
Document and maintain runbooks and procedures, automate as much as possible
Plan and execute periodic Disaster Recovery exercises including both tabletop and simulated failures (fault injection)
8+ years of SRE or Systems Engineering experience and total of 12-15 years of software industry experience
Experience with Any SRE tool, (Grafana, Dynatrace, Splunk are preferable)
Experience with Distributed tracing
Experience with establishing hooks into CI/CD pipeline in lower environments for SRE violations
Strong analytical and problem-solving mindset combined with experience troubleshooting under pressure
Strategic thinking, complex problem solving and analytical capabilities
Strong organizational and interpersonal skills, with experience developing and instilling a culture of operational maturity
Ability to adjust quickly to new technologies
Job Type: Fixed term contract
Contract length: 6 months
Schedule:
- 8 hour shift
- Day shift
- Monday to Friday
Experience:
- SRE: 8 years (required)
- Grafana: 5 years (required)
Work Location: Hybrid remote in Halifax, NS B4B 1R6