Docs - Autoscaler


Warning: The Autoscaler is currently in Beta, and is subject to changes, 'quirks', and downtime.

The Autoscaler manages groups of instances to serve an application and scale up or down according to load (as defined via the vast-pyworker metrics code). The Autoscaler and accompanying GPU worker code automates instance management (creation,deletion,stopping,starting,etc), performance/capacity measurement, and error handling.

Endpoint Groups and Autogroups #

There are two levels of classification for using the autoscaler: Endpoint Groups and Autogroups. Endpoint Groups are the higher level category that automatically manage a group of instances to serve an endpoint in response to user load. Autogroups define the configuration of machines which will run the code to serve this endpoint. Autogroups can be defined by templates and can easily be created by using one of our autoscaler compatible templates that are modified versions of popular templates on vast. You can have multiple Autogroups in a single Endpoint Group to try different machine configurations that serve the same endpoint, and the autoscaler will automatically track which machine config has better performance and take this information into account when managing instances.

System Architecture #

The general system architecture for a complete application using Autoscaling has a few main components:

  1. the autoscaler (
  2. the loadbalancer (
  3. gpu worker code (We provide a framework and examples, you customize/replace as needed)
  4. application website (yours)

An example workflow (for a consumer LLM app), works like this:

  1. A customer initiates a request through your website
  2. Your website makes a call to with your endpoint, apikey, any optional params (cost)
  3. The /route/ endpoint returns a suitable worker address
  4. Your website makes a call to your GPU worker's specific API endpoint, like {worker_address}/generate (or /generate-stream/ etc), passing the info returned by /route, along with any request params (prompt, etc)
  5. Your website then returns the results to the client browser (or does whatever else you want)

We currently use this 2-step routing process for simplicity and flexibility. It also has the additional benefit that you don't need to route all your prompts (or other user data) through any central server provided by, as our load-balancer doesn't need those details to route your request.

The /route/ endpoint signs the messages it returns with its public key (available at ) so the GPU worker then can check this signature to validate requests (to prevent others from using your workers).

We likely will add an optional proxy service in the future to optionally reduce this 2-step process to a single step, but for now its a 2-step process (Since you probably need to proxy customer traffic anyway, the 2-step method may reduce latency)

A full example walkthrough for LLM inference using HuggingFace TGI for the worker backend is available in this guide