Job Description
Contract // Toronto ON. Please share resumes to Amarjeet.Kumar@akkodisgroup.com
Role:- Site Reliability Engineering (SRE)
Location-Toronto ON
Job Type:- Contract
ROLE_DESCRIPTION
• Monitor and maintain system reliability using tools like DataDog, VictorOps, ELK, Grafana, and Prometheus.
• Ensure uptime and performance by proactively identifying issues and responding to alerts.
• Troubleshoot, investigate and resolve complex technical issues. If required, collaborate with the engineering team for timely issue resolution.
• Handle production incidents by analyzing root causes, prioritizing resolution, escalating as needed, and adhering to defined SLAs, SLIs, and SLOs.
• Develop and implement automation scripts (Python or other scripting languages) to streamline operational tasks, improve system efficiencies, and reduce manual workload.
• Manage and maintain infrastructure across AWS environments.
• Implement best practices to ensure optimal performance, reliability, and security of cloud-based applications.
• Work closely with development, QA, and operations teams to drive continuous improvement and foster a culture of reliability.
• Manage requests and incidents through JIRA and ServiceNow, documenting troubleshooting procedures, solutions, and lessons learned for continuous improvement.
• In executing SRE activities, the assigned engineers need to use customer provided HW