EverOpsEverOps

Lead Site Reliability Engineer – IT Support Automation

Added 8 hours ago

Overview

As technology organizations scale, so does operational friction. IT support teams become overloaded with repetitive tickets — account lockouts, access requests, provisioning tasks, and standard “ask IT” issues that drain time and attention from higher-value work.

EverOps partners directly with enterprise engineering and IT organizations to solve complex operational challenges from within their environments. We don’t patch symptoms — we eliminate root causes.

We are seeking a Lead Site Reliability Engineer to own and execute a comprehensive IT support automation strategy designed to significantly reduce ticket volume and human intervention.

The Challenge

This is not a reactive support role.

This is a systems-level engineering role focused on:

  • Eliminating tickets before they are created

  • Automating resolution paths when tickets do occur

  • Building durable automation frameworks across SaaS and internal platforms

  • Removing systemic friction across the IT lifecycle

You will operate heavily within the IT support domain, addressing areas such as:

  • Account lockouts and access management

  • Provisioning and deprovisioning workflows

  • Device and asset lifecycle management

  • Standard internal IT requests

  • SaaS integrations and workflow orchestration

The expectation is leadership-level ownership. You will define the automation roadmap, architect solutions, and drive initiatives from intake through deployment with measurable outcomes.

The Mission

As a Lead SRE, your mission is to:

  • Reduce human intervention across IT support workflows

  • Build automation systems that scale without increasing headcount

  • Architect reliable, observable, production-grade automation services

  • Establish engineering standards for automation development

  • Mentor junior engineers while maintaining direct ownership of delivery

Success is measured in outcomes:

  • Reduced ticket creation rates

  • Increased fully automated resolution percentages

  • Improved user satisfaction while lowering operational burden

This role requires deep technical capability combined with strong execution discipline and cross-functional influence.

What You’ll Do

1. Root-Cause Ticket Elimination

  • Analyze ticket trends and identify systemic failure patterns

  • Redesign workflows to remove recurring pain points

  • Replace reactive fixes with preventative engineering solutions

  • Partner with IT and engineering stakeholders to prioritize high-leverage automation opportunities

2. End-to-End Automation Architecture

  • Design and implement automation workflows across multiple SaaS platforms

  • Integrate with third-party and internal APIs (e.g., identity providers, collaboration tools, asset systems, ticketing platforms)

  • Architect resilient API integrations including:

    • Authentication & authorization flows (OAuth2, SAML, token management)

    • Rate limiting and retry strategies

    • Error handling and observability

  • Build self-service systems that allow users to resolve common requests without human escalation

3. Custom Service & Tooling Development

When no off-the-shelf solution exists, you will:

  • Build lightweight microservices or serverless functions (Python or Go preferred)

  • Develop internal middleware, proxies, or orchestration services

  • Create background automation jobs (cron-style processes)

  • Containerize and deploy services using modern DevOps practices

You will make thoughtful build-vs-buy decisions, balancing speed, maintainability, and long-term scalability.

4. Reliability, Observability & Production Standards

Automation must be as reliable as any production system.

You will:

  • Implement Infrastructure as Code (Terraform, Pulumi, or similar)

  • Maintain CI/CD pipelines for automation services

  • Design monitoring, logging, and alerting frameworks

  • Define SLIs/SLOs to measure automation reliability

  • Ensure automation services are secure, observable, and resilient

This is not scripting — this is platform-grade engineering.

5. Lead-Level Ownership & Execution

This role requires operating as a single-threaded owner for major initiatives.

You will:

  • Define solution architecture from concept to deployment

  • Set timelines and milestones autonomously

  • Conduct feasibility validation in development environments

  • Communicate proactively with stakeholders

  • Re-scope tactically to maintain forward momentum when blocked

  • Deliver measurable impact — not just activity

You are expected to think systemically, move with urgency, and drive initiatives to completion without requiring micro-management.

You Have

Experience

  • 8+ years in SRE, Platform Engineering, DevOps, or Automation Engineering

  • Proven experience designing enterprise-scale automation systems

  • Strong exposure to IT support domains (access, provisioning, identity, device lifecycle, SaaS operations)

Technical Strength

API & Integration Expertise

  • Deep experience designing and consuming REST APIs

  • Strong understanding of authentication and authorization patterns

  • Experience orchestrating workflows across multiple SaaS platforms

Programming & Automation

  • Strong proficiency in Python or Go

  • Experience building production-ready services

  • Advanced scripting for orchestration and automation logic

Cloud & Infrastructure

  • Strong familiarity with at least one major cloud provider (AWS, GCP, or Azure)

  • Containerization and Kubernetes exposure

  • Infrastructure as Code experience

Systems Thinking

  • Networking fundamentals

  • Identity and access concepts

  • Understanding of asset lifecycle management

Leadership & Communication

  • Experience leading technical initiatives from idea through deployment

  • Ability to mentor junior engineers

  • Strong written and verbal communication skills

  • Comfortable influencing cross-functional stakeholders

  • Data-driven decision-making approach

You think in terms of leverage, scale, and long-term impact.

What Success Looks Like

Within 6–12 months, you will have:

  • Eliminated entire categories of recurring IT tickets

  • Implemented durable automation frameworks across core IT workflows

  • Increased automated resolution rates quarter over quarter

  • Reduced manual provisioning and access overhead

  • Established scalable, observable automation systems that continue to compound value

Your impact will be visible in metrics — not anecdotes.

Nice to Have

  • Experience integrating AI/LLM capabilities into workflow automation

  • Familiarity with ITSM frameworks

  • Background building internal self-service platforms

  • Experience presenting technical strategy to senior leadership

  • Experience operating in high-scale, compliance-sensitive environments

Benefits

  • 100% Remote Workplace

  • Unlimited Paid Time Off

  • Equity – Become a true owner of the company

  • 401K with company contribution and sponsored healthcare

  • Professional Growth – Access to training and certification programs