The Vast.ai Autoscaler manages groups of instances to efficiently serve applications, automatically scaling up or down based on load metrics defined by the Vast PyWorker. It streamlines instance management, performance measurement, and error handling.
The Autoscaler needs to be configured at two levels:
Having two-level autoscaling provides several benefits. Here are some examples to help illustrate this:
Comparing Performance Metrics Across Hardware: Suppose you want to run the same templates on different hardware to compare performance metrics. You can create several autoscaling groups, each configured to run on specific hardware. By leaving this setup running for a period of time, you can review the metrics and select the most suitable hardware for your users' needs.
Smooth Rollout of a New Model: If you're using TGI to handle LLM inference with LLama2 and want to transition to LLama3, you can do so gradually. For a smooth rollout where only 10% of user requests are handled by LLama3, you can create a new autoscaling group under the existing Endpoint Group. Let it run for a while, review the metrics, and then fully switch to LLama3 when ready.
Handling Diverse Workloads with Multiple Models: You can create an Endpoint Group to manage LLM inference using TGI. Within this group, you can set up multiple Autogroups, each using a different LLM to serve requests. This approach is beneficial when you need a few resource-intensive models to handle most requests, while smaller, more cost-effective models manage overflow during workload spikes.
It's important to note that having multiple Autogroups within a single Endpoint Group is not always necessary. For most users, a single Autogroup within an Endpoint Group provides an optimal setup.
You can create Autogroups using our autoscaler-compatible templates, which are customized versions of popular templates on Vast.
The system architecture for an application using Vast.ai Autoscaling includes the following components:
https://run.vast.ai/route/
with your endpoint, API key, and any optional parameters (e.g., cost).{worker_address}/generate
, passing the info returned by /route/
along with request parameters (e.g., prompt).This 2-step routing process is used for simplicity and flexibility. It ensures that you don't need to route all user data through a central server provided by Vast.ai, as our load-balancer doesn't require those details to route your request.
The /route/
endpoint signs its messages with a public key available at https://run.vast.ai/pubkey/
,
allowing the GPU worker to validate requests and prevent unauthorized usage.
In the future, we may add an optional proxy service to reduce this to a single step.
For a detailed walkthrough of LLM inference using HuggingFace TGI as the worker backend, refer to this guide.