Description
The role
This role is for Nebius AI R&D, a team focused on applied research and the development of AI-heavy products. Examples of applied research that we have recently published include:
- investigating how test-time guided search can be used to build more powerful agents;
- dramatically scaling task data collection to power reinforcement learning for SWE agents;
- maximizing efficiency of LLM training on agentic trajectories.
One example of an AI product that we are deeply involved in is Nebius Token Factory — an inference and fine-tuning platform for AI models.
This role will require expertise in distributed systems to build large-scale LLM training platform.
Your responsibilities will include:
- Designing and developing LLM training platform.
- Maintaining our ML infrastructure, ensuring optimal performance, scalability and reliability.
- Improving job scheduling strategies to minimize resource fragmentation.
We expect you to have:
- 5+ years of professional software development experience.
- Strong software engineering skills (we mostly use Python).
- Proficiency in contemporary software engineering approaches, including CI/CD, version control and unit testing.
- Experience with developing web services.
- A commitment to maintaining extreme rigor in all job-related activities.
Nice to have:
- Previous experience working with language models or other similar NLP technologies.
- A track record of building and delivering products (not necessarily ML-related) in a dynamic startup-like environment.
- Strong engineering skills, including experience in developing large distributed systems or high-load web services.
- Open-source projects that showcase your engineering prowess.
Company
Nebius provides an AI-focused cloud platform enabling scalable GPU clusters (from single GPU to thousands of NVIDIA GPUs) with pre-configured drivers, InfiniBand networking, and orchestrators like Kubernetes or Slurm. It offers fully managed services (MLflow, PostgreSQL, Apache Spark), cloud-native tooling (Terraform, API, CLI), ready-to-go solutions, and expert support. Nebius also runs data centers and is active in AI research collaborations and open-source AI ecosystem examples (vLLM, CRISPR-GPT references) and has partnerships with NVIDIA as Reference Platform Cloud Partner.
Related postings
Heidi Health
Senior Software EngineerLondon, UKWing Assistant
Senior Software EngineerBengaluru, Karnataka, IndiaExperian
Senior Software EngineerHyderabad, Telangana, IndiaEurofins
Senior Software EngineerBengaluru, Karnataka, India