Site Reliability Engineering (SRE)

Job Description

  • Contractor
  • Anywhere

Contract // Toronto ON. Please share resumes to Amarjeet.Kumar@akkodisgroup.com

Role:- Site Reliability Engineering (SRE)
Location-Toronto ON
Job Type:- Contract

ROLE_DESCRIPTION
•      Monitor and maintain system reliability using tools like DataDog, VictorOps, ELK, Grafana, and Prometheus.
•      Ensure uptime and performance by proactively identifying issues and responding to alerts.
•      Troubleshoot, investigate and resolve complex technical issues. If required, collaborate with the engineering team for timely issue resolution.
•      Handle production incidents by analyzing root causes, prioritizing resolution, escalating as needed, and adhering to defined SLAs, SLIs, and SLOs.
•      Develop and implement automation scripts (Python or other scripting languages) to streamline operational tasks, improve system efficiencies, and reduce manual workload.
•      Manage and maintain infrastructure across AWS environments.
•      Implement best practices to ensure optimal performance, reliability, and security of cloud-based applications.
•      Work closely with development, QA, and operations teams to drive continuous improvement and foster a culture of reliability.
•      Manage requests and incidents through JIRA and ServiceNow, documenting troubleshooting procedures, solutions, and lessons learned for continuous improvement.
•      In executing SRE activities, the assigned engineers need to use customer provided HW