Architecture

Endpoint Groups and Autoscaling Groups

System Architecture

Example Workflow for a Consumer LLM App

Two-Step Routing Process

Docs - Autoscaler

The Vast.ai Autoscaler manages groups of instances to efficiently serve applications, automatically scaling up or down based on load metrics defined by the Vast PyWorker. It streamlines instance management, performance measurement, and error handling.

Endpoint Groups and Autoscaling Groups #

The Autoscaler needs to be configured at two levels:

Endpoint Groups: These represent a collection of autoscaling groups that collectively handle the same API requests.
Autoscaling Groups: These define the configuration of machines running the code to serve the endpoint. Multiple Autogroups can exist within an Endpoint Group, with the Autoscaler monitoring performance and optimizing instance management as needed.

Having two-level autoscaling provides several benefits. Here are some examples to help illustrate this:

Comparing Performance Metrics Across Hardware: Suppose you want to run the same templates on different hardware to compare performance metrics. You can create several autoscaling groups, each configured to run on specific hardware. By leaving this setup running for a period of time, you can review the metrics and select the most suitable hardware for your users' needs.
Smooth Rollout of a New Model: If you're using TGI to handle LLM inference with LLama2 and want to transition to LLama3, you can do so gradually. For a smooth rollout where only 10% of user requests are handled by LLama3, you can create a new autoscaling group under the existing Endpoint Group. Let it run for a while, review the metrics, and then fully switch to LLama3 when ready.
Handling Diverse Workloads with Multiple Models: You can create an Endpoint Group to manage LLM inference using TGI. Within this group, you can set up multiple Autogroups, each using a different LLM to serve requests. This approach is beneficial when you need a few resource-intensive models to handle most requests, while smaller, more cost-effective models manage overflow during workload spikes.

It's important to note that having multiple Autogroups within a single Endpoint Group is not always necessary. For most users, a single Autogroup within an Endpoint Group provides an optimal setup.

You can create Autogroups using our autoscaler-compatible templates, which are customized versions of popular templates on Vast.

System Architecture #

The system architecture for an application using Vast.ai Autoscaling includes the following components:

Autoscaler (Vast.ai)
Load Balancer (Vast.ai)
GPU Worker Code (Customize using our Pyworker framework and examples)
Application Website (Your responsibility)

Example Workflow for a Consumer LLM App #

A customer initiates a request through your website.
Your website calls https://run.vast.ai/route/ with your endpoint, API key, and any optional parameters (e.g., cost).
The /route/ endpoint returns a suitable worker address.
Your website calls the GPU worker's specific API endpoint, like {worker_address}/generate, passing the info returned by /route/ along with request parameters (e.g., prompt).
Your website returns the results to the client's browser or handles them as needed.

Two-Step Routing Process #

This 2-step routing process is used for simplicity and flexibility. It ensures that you don't need to route all user data through a central server provided by Vast.ai, as our load-balancer doesn't require those details to route your request.

The /route/ endpoint signs its messages with a public key available at https://run.vast.ai/pubkey/, allowing the GPU worker to validate requests and prevent unauthorized usage.

In the future, we may add an optional proxy service to reduce this to a single step.

For a detailed walkthrough of LLM inference using HuggingFace TGI as the worker backend, refer to this guide.