DevOps Engineer
LiteLLM is the world’s most popular AI Gateway used by the largest companies (Adobe, Netflix, NASA, etc.) in the world to give their developers access to LLMs and adjacent services (MCP’s, Vector Stores, etc.).
Why do companies use LiteLLM Enterprise
Companies use LiteLLM Enterprise once they put LiteLLM into production and need enterprise features like Prometheus metrics (production monitoring) and need to give LLM access to a large number of people with SSO (secure sign on) or JWT (JSON Web Tokens).
What you will be working on
We are hiring an exceptional engineer to own release infrastructure and release security at LiteLLM. This is an opportunity to join us in-person as an early employee and make a large impact at a high growth start-up. You will own a critical part of the company: making sure we can ship secure, reliable releases on a consistent cadence with a high degree of autonomy and ownership.
We work 6 days per week in our SF office, approximately 60 hours per week in total.
We are looking for a software engineer with a strong background in infrastructure, CI/CD, and release engineering. You should be comfortable working across Helm, Terraform, release automation, testing systems, and the developer infrastructure needed to guarantee stable releases. This is a hands-on role.
You should be able to investigate test failures, distinguish real regressions from flaky tests, write Python, fix minor test issues, remove dead tests, and improve the overall reliability of the release pipeline. You should also be able to architect a secure end-to-end release process: how code moves from commit to published artifact, how access is controlled, how secrets are handled, and how we reduce the chance of bad or unauthorized releases.
What you will do
Own secure, regular releases for LiteLLM, including 2 nightly releases and 1 stable release, per week.
Manage and improve the infrastructure behind our release process, including Helm, Terraform, CI/CD, and other developer systems needed to keep releases stable.
Investigate test failures and determine whether they are true regressions, flaky tests, or dead tests that should be fixed or removed.
Write Python to fix minor test issues, improve release reliability, and support developer workflows.
Architect and implement a secure release process across build, test, approval, and publish steps.
Work closely with the engineering team to improve release quality, reduce operational risk, and keep shipping velocity high.
What we're looking for
2+ years of experience in infrastructure engineering, DevSecOps, release engineering, or related systems work.
Proficient in Python and comfortable making code changes in test and release systems.
Experience with Terraform, Helm, CI/CD systems, and cloud infrastructure.
Strong judgment around release reliability, testing, and debugging.
Ability to distinguish between real regressions and flaky infrastructure or test behavior.
Ability to design secure release processes, including access controls, secrets handling, and safe publishing workflows.
Ability to collaborate effectively with engineers across product, infra, and security.
About LiteLLM
LiteLLM (https://github.com/BerriAI/litellm) is a Python SDK, Proxy Server (LLM Gateway) to call 100+ LLM APIs in OpenAI format - [Bedrock, Azure, OpenAI, VertexAI, Cohere] and is used by companies like Rocket Money, Adobe, Twilio, and Siemens.