Warning: The Autoscaler is currently in Beta, and is subject to changes, 'quirks', and downtime.
The Autoscaler manages groups of instances, called autogroups, to serve an application and scale up or down according to load (as defined via the vast-pyworker metrics code). The autoscaler and accompanying GPU worker code automates instance management (creation,deletion,stopping,starting,etc), performance/capacity measurement, and error handling.
The general system architecture for a complete application using Vast.ai Autoscaling has a few main components:
An example workflow (for a consumer LLM app), works like this:
We currently use this 2-step routing process for simplicity and flexibility. It also has the additional benefit that you don't need to route all your prompts (or other user data) through any central server provided by vast.ai, as our load-balancer doesn't need those details to route your request.
The /route/ endpoint signs the messages it returns with its public key (available at https://run.vast.ai/pubkey/ ) so the GPU worker then can check this signature to validate requests (to prevent others from using your workers).
We likely will add an optional proxy service in the future to optionally reduce this 2-step process to a single step, but for now its a 2-step process (Since you probably need to proxy customer traffic anyway, the 2-step method may reduce latency)
A full example walkthrough for LLM inference using HuggingFace TGI for the worker backend is available in this guide