Warning
The Autoscaler is currently in Beta and may experience changes, quirks, and downtime.
The Vast.ai Autoscaler manages groups of instances to serve applications, scaling up or down based on load metrics defined via the Vast PyWorker. It automates instance management, performance measurement, and error handling.
You can create Autogroups using our autoscaler compatible templates, which are modified versions of popular templates on Vast.
The system architecture for an application using Vast.ai Autoscaling includes the following components:
https://run.vast.ai/route/
with your endpoint, API key, and any optional parameters (e.g., cost).{worker_address}/generate
, passing the info returned by /route/
along with request parameters (e.g., prompt).This 2-step routing process is used for simplicity and flexibility. It ensures that you don't need to route all user data through a central server provided by Vast.ai, as our load-balancer doesn't require those details to route your request.
The /route/
endpoint signs its messages with a public key available at https://run.vast.ai/pubkey/
, allowing the GPU worker to validate requests and prevent unauthorized usage.
In the future, we may add an optional proxy service to reduce this to a single step.
For a detailed walkthrough of LLM inference using HuggingFace TGI as the worker backend, refer to this guide.