Senior Site Reliability Engineer
Added 6 days agoWe are seeking a highly skilled Senior Site Reliability Engineer to join our Engineering team in India. This role is a split-duty position comprising both customer-facing responsibilities and internal platform reliability initiatives.
As a Senior SRE, you will play a critical role in deploying, maintaining, and improving the reliability and scalability of Selector’s platform across on-premises and SaaS environments. You will collaborate closely with Platform Engineering, DevOps, and customer teams to ensure seamless deployments, strong system performance, and continuous platform improvement.
Key Responsibilities
- Serve as a senior technical expert in deploying and maintaining Selector’s operational analytics platform across on-premises and SaaS environments.
- Lead complex customer installations, including deployments in air-gapped and highly regulated environments.
- Partner directly with customers via Zoom/Teams to troubleshoot, triage services, and resolve installation or performance nuances.
- Author, review, and maintain Infrastructure as Code (IaC) using Terraform/OpenTofu, ensuring scalable and maintainable infrastructure design.
- Deploy and manage containerized applications using Kubernetes (including RKE) and Kustomize in production environments.
- Triage and resolve issues across distributed systems, Kafka pipelines, CI/CD workflows (Jenkins), and Google Cloud infrastructure.
- Provide structured, actionable feedback to Platform Engineering and DevOps teams to improve reliability, scalability, and performance.
- Participate in and help mature on-call processes, ensuring high availability and operational excellence.
- Perform root cause analysis for production incidents and implement long-term corrective and preventative solutions.
- Research, evaluate, and implement new tools or architectural improvements to address infrastructure and operational challenges.
- Mentor junior engineers and promote SRE best practices across reliability, observability, and automation.
- Improve internal tooling, automation, and operational workflows to enhance developer productivity and system stability.