SelectorSelector

Senior Site Reliability Engineer

Added 6 days ago

We are seeking a highly skilled Senior Site Reliability Engineer to join our Engineering team in India. This role is a split-duty position comprising both customer-facing responsibilities and internal platform reliability initiatives.

As a Senior SRE, you will play a critical role in deploying, maintaining, and improving the reliability and scalability of Selector’s platform across on-premises and SaaS environments. You will collaborate closely with Platform Engineering, DevOps, and customer teams to ensure seamless deployments, strong system performance, and continuous platform improvement.

Key Responsibilities

  • Serve as a senior technical expert in deploying and maintaining Selector’s operational analytics platform across on-premises and SaaS environments.
  • Lead complex customer installations, including deployments in air-gapped and highly regulated environments.
  • Partner directly with customers via Zoom/Teams to troubleshoot, triage services, and resolve installation or performance nuances.
  • Author, review, and maintain Infrastructure as Code (IaC) using Terraform/OpenTofu, ensuring scalable and maintainable infrastructure design.
  • Deploy and manage containerized applications using Kubernetes (including RKE) and Kustomize in production environments.
  • Triage and resolve issues across distributed systems, Kafka pipelines, CI/CD workflows (Jenkins), and Google Cloud infrastructure.
  • Provide structured, actionable feedback to Platform Engineering and DevOps teams to improve reliability, scalability, and performance.
  • Participate in and help mature on-call processes, ensuring high availability and operational excellence.
  • Perform root cause analysis for production incidents and implement long-term corrective and preventative solutions.
  • Research, evaluate, and implement new tools or architectural improvements to address infrastructure and operational challenges.
  • Mentor junior engineers and promote SRE best practices across reliability, observability, and automation.
  • Improve internal tooling, automation, and operational workflows to enhance developer productivity and system stability.