NebiusNebius

Senior Software Engineer (Token Factory)

Added 2 months ago

Description

The role

This role is for Nebius AI R&D, a team focused on applied research and the development of AI-heavy products. Examples of applied research that we have recently published include:

  • investigating how test-time guided search can be used to build more powerful agents;
  • dramatically scaling task data collection to power reinforcement learning for SWE agents;
  • maximizing efficiency of LLM training on agentic trajectories.

One example of an AI product that we are deeply involved in is Nebius Token Factory — an inference and fine-tuning platform for AI models.

This role will require expertise in distributed systems to build large-scale LLM training platform.

Your responsibilities will include: 

  • Designing and developing LLM training platform.
  • Maintaining our ML infrastructure, ensuring optimal performance, scalability and reliability.
  • Improving job scheduling strategies to minimize resource fragmentation. 

We expect you to have:

  • 5+ years of professional software development experience.
  • Strong software engineering skills (we mostly use Python).
  • Proficiency in contemporary software engineering approaches, including CI/CD, version control and unit testing.
  • Experience with developing web services.
  • A commitment to maintaining extreme rigor in all job-related activities.

Nice to have:

  • Previous experience working with language models or other similar NLP technologies.
  • A track record of building and delivering products (not necessarily ML-related) in a dynamic startup-like environment.
  • Strong engineering skills, including experience in developing large distributed systems or high-load web services.
  • Open-source projects that showcase your engineering prowess.

Company

Nebius provides an AI-focused cloud platform enabling scalable GPU clusters (from single GPU to thousands of NVIDIA GPUs) with pre-configured drivers, InfiniBand networking, and orchestrators like Kubernetes or Slurm. It offers fully managed services (MLflow, PostgreSQL, Apache Spark), cloud-native tooling (Terraform, API, CLI), ready-to-go solutions, and expert support. Nebius also runs data centers and is active in AI research collaborations and open-source AI ecosystem examples (vLLM, CRISPR-GPT references) and has partnerships with NVIDIA as Reference Platform Cloud Partner.

See more senior software engineer (token factory) remote jobs