Docs - Autoscaler

Architecture

Warning
The Autoscaler is currently in Beta and may experience changes, quirks, and downtime.

The Vast.ai Autoscaler manages groups of instances to serve applications, scaling up or down based on load metrics defined via the Vast PyWorker. It automates instance management, performance measurement, and error handling.

Endpoint Groups and Autogroups #

  • Endpoint Groups: Automatically manage a group of instances to serve an endpoint in response to user load.
  • Autogroups: Define the configuration of machines that run the code to serve the endpoint. Multiple Autogroups can exist within an Endpoint Group to test different configurations. The autoscaler tracks performance and optimizes instance management accordingly.

You can create Autogroups using our autoscaler compatible templates, which are modified versions of popular templates on Vast.

System Architecture #

The system architecture for an application using Vast.ai Autoscaling includes the following components:

  1. Autoscaler (Vast.ai)
  2. Load Balancer (Vast.ai)
  3. GPU Worker Code (Customize using our Pyworker framework and examples)
  4. Application Website (Your responsibility)

Example Workflow for a Consumer LLM App #

  1. A customer initiates a request through your website.
  2. Your website calls https://run.vast.ai/route/ with your endpoint, API key, and any optional parameters (e.g., cost).
  3. The /route/ endpoint returns a suitable worker address.
  4. Your website calls the GPU worker's specific API endpoint, like {worker_address}/generate, passing the info returned by /route/ along with request parameters (e.g., prompt).
  5. Your website returns the results to the client's browser or handles them as needed.

Two-Step Routing Process #

This 2-step routing process is used for simplicity and flexibility. It ensures that you don't need to route all user data through a central server provided by Vast.ai, as our load-balancer doesn't require those details to route your request.

The /route/ endpoint signs its messages with a public key available at https://run.vast.ai/pubkey/, allowing the GPU worker to validate requests and prevent unauthorized usage.

In the future, we may add an optional proxy service to reduce this to a single step.

For a detailed walkthrough of LLM inference using HuggingFace TGI as the worker backend, refer to this guide.