Job Description
Role : Digital-Azure SRE
Location: Toronto / Hybrid
Email : rahul.n@epsilonsolutions.ca
JOB DESCRIPTION:
We are looking for a Site Reliability Engineer to work for the Digital Line of Business for our client. An ideal candidate should be the one that has experience with scripting & Microsoft Azure cloud platform.
Job Description:
**Monitoring and Alerting:**
Implement and maintain monitoring systems to proactively identify potential issues and alert engineers to problems before they impact users.
**Incident Response:**
Respond to incidents and outages, diagnose problems, and implement solutions to minimize downtime and restore service.
**Automation:**
Automate repetitive tasks and processes to improve efficiency and reduce manual effort.
**Performance Optimization:**
Identify and address performance bottlenecks to ensure systems run efficiently and effectively.
**Infrastructure Management:**
Manage and maintain the underlying infrastructure, including servers, networks, and cloud resources.
**Capacity Planning:**
Plan for future capacity needs to ensure systems can handle anticipated workloads.
**Release Engineering:**
Develop and maintain processes for deploying software updates and releases.
**Collaboration:**
Work closely with developers, operations teams, and other stakeholders to ensure system reliability and availability.
**Documentation:**
Maintain clear and concise documentation of systems, processes, and procedures.
**Continuous Improvement:**
Identify areas for improvement and implement changes to enhance system reliability and performance.
**Skills and Qualifications:**
**Programming Skills**: Proficiency in scripting languages (e.g., Python, Bash) and experience with programming languages (e.g., Java, Go).
**Operating Systems**: Knowledge of Linux and Windows server administration.
**Networking**: Understanding of network protocols and infrastructure.
**Cloud Computing**: Experience with cloud platforms (e.g., AWS, Azure, GCP).
**Database Management**: Familiarity with relational and NoSQL databases.
**Monitoring Tools**: Experience with monitoring tools (e.g., Prometheus, Grafana, Splunk).
**Automation Tools:** Experience with automation tools (e.g., Ansible, Terraform, Docker).
**Problem-Solving**: Strong analytical and problem-solving skills.
**Communication**: Excellent communication and collaboration skills.
**Incident Management**: Experience with incident response and management.
**Change Management**: Experience with change management processes.
**DevOps:** Understanding of DevOps principles and practices”