Description
Site Reliability Engineer role description for Blaxel focusing on reliability, performance, and scalability of AI infrastructure platform. Responsible for observability, incident response, automation, and securing infrastructure across compute, networking, storage, and sandboxed execution layers. Emphasis on building and operating core infrastructure powering 25ms cold-start compute engine, SLOs/SLIs, self-healing automation, and collaboration with platform engineers.
Company
Blaxel offers a persistent sandbox platform that keeps sandboxes in standby to preserve memory and context, letting AI workloads resume instantly. Sandboxes auto-suspend when idle, incurring zero standby compute cost, and resume in about 25ms with full memory and a memory-backed filesystem. The system co-locates agents and data on a high-speed backbone to minimize latency, supports long-term data retention with block-storage volumes, and provides configurable networking (custom domains, dedicated IPs, VPC). The Blaxel SDK enables spawning thousands of sandboxes for batch work, with pricing based on active compute time and the option to scale to 50,000+ sandboxes.
Related postings
Gamma
Site Reliability EngineerSan Francisco, CA, USAEngFlow
Site Reliability EngineerSan Francisco, CA, USAClay
Site Reliability EngineerNew York, NY, USATWO95 International, Inc
Site Reliability EngineerPhoenix, AZ, USA