Production Support Engineer / SRE

Job Description

  • Contractor
  • Anywhere

Hiring: Production Support Engineer / SRE 📍 Location: Toronto, ON (4 Days Onsite) 🔹 Role Overview We are seeking a skilled Production Support Engineer / SRE to support critical digital applications and backend systems. This role blends hands-on production support with Site Reliability Engineering (SRE) practices, focusing on automation, infrastructure reliability, and high availability. 🔹 Key Responsibilities 🛠️ Infrastructure & Toil Reduction ✅ SSL/TLS certificate renewals & security updates ✅ Linux & Windows server patching ✅ Credential rotation & vulnerability remediation ✅ Identify and eliminate repetitive manual tasks 🗄️ Database & Cluster Management ✅ Manage Elasticsearch, MongoDB, and Redis clusters ✅ Performance tuning, scaling, backup & recovery ✅ Disaster recovery & failover execution ✅ Capacity planning & system optimization ⚙️ Automation & SRE ✅ Build & maintain Ansible playbooks ✅ Develop Infrastructure-as-Code solutions ✅ Create runbooks & operational playbooks ✅ Implement monitoring, alerting & auto-remediation 🚨 Production Support ✅ Troubleshoot and resolve critical incidents ✅ Participate in RCA & postmortems ✅ Monitor application performance & uptime ✅ Collaborate with development & infra teams 🔹 Must-Have Skills ✔ 5+ years in Production Support / SRE ✔ Strong experience with: Ansible (2+ years) Elasticsearch & MongoDB (2+ years) Redis, OpenShift, Azure ✔ Linux (RHEL/CentOS/Ubuntu) & Windows Server Admin ✔ Shell scripting (Bash) ✔ Incident management & troubleshooting 🔹 Nice to Have ⭐ Kubernetes / container platforms ⭐ Python or Go scripting ⭐ CI/CD tools (Jenkins, GitLab, Azure DevOps) ⭐ Monitoring tools (Prometheus, Grafana, ELK, Datadog) ⭐ Terraform / CloudFormation ⭐ Certifications (AZ-900, CKA, Elasticsearch) 🔹 Key Requirements 📌 Experience in financial services (preferred) 📌 Strong analytical & problem-solving skills 📌 Ability to work independently & in teams 📌 Excellent documentation & communication 🔹 Operational Expectations ⏱️ On-call rotation (with notice) ⚡ Critical incident response within 30 minutes 📊 Maintain runbooks & knowledge base 🤝 Regular collaboration across teams 📩 Interested candidates can apply or share profiles!