BlaxelBlaxel

Site Reliability Engineer

Added 2 months ago

Description

Site Reliability Engineer role description for Blaxel focusing on reliability, performance, and scalability of AI infrastructure platform. Responsible for observability, incident response, automation, and securing infrastructure across compute, networking, storage, and sandboxed execution layers. Emphasis on building and operating core infrastructure powering 25ms cold-start compute engine, SLOs/SLIs, self-healing automation, and collaboration with platform engineers.

Company

Blaxel offers a persistent sandbox platform that keeps sandboxes in standby to preserve memory and context, letting AI workloads resume instantly. Sandboxes auto-suspend when idle, incurring zero standby compute cost, and resume in about 25ms with full memory and a memory-backed filesystem. The system co-locates agents and data on a high-speed backbone to minimize latency, supports long-term data retention with block-storage volumes, and provides configurable networking (custom domains, dedicated IPs, VPC). The Blaxel SDK enables spawning thousands of sandboxes for batch work, with pricing based on active compute time and the option to scale to 50,000+ sandboxes.

See more site reliability engineer jobs in San Francisco, CA, USA + remote