Docs - Autoscaler

Getting Started

Some popular templates on Vast such as text-generation-inference and Comfy UI can be run in API mode to act as inference servers as the backend of an application.

Vast's autoscaling service automates instance management, performance tracking, and error handling. The autoscaler also provides authentication services to ensure requests coming to your Vast instances are only coming from approved clients.

Note: This guide assumes knowledge of the Vast CLI, and an introduction for it can be found here.

We highly recommend reading about the autoscaler architecture here before you start.

For this example, we will set up an endpoint group that uses TGI and LLama3 model to serve inference requests

1) Create your Endpoint Group #

To use API endpoint, you need to create an "Endpoint Group" (aka endptgroup) that manages your endpoint in response to incoming load.

You can do this through the GUI. You can also use the CLI. Here, We'll create an endpoint group named "TGI-Llama3"

vastai create endpoint --endpoint_name "TGI-Llama3" --cold_mult 1.0 --min_load 100 --target_util 0.9 --max_workers 20 --cold_workers 5
  • "min_load" : This is the baseline amount of load (tokens / second for LLMs) you want your autoscaling group to be able to handle.

    • For LLMs, a good default is 100.0
    • For Text2Image, a good default is 200.0
  • "target_util" : The percentage of your autogroup compute resources that you want to be in-use at any given time. A lower value allows for more slack, which means your instance group will be less likely to be overwhelmed if there is a sudden spike in usage.

    • For LLMs, a good default is 0.9
    • For Text2Image, a good default is 0.4
      • ComfyUI, the backend used for Text2Image does not support parallel requests. This means requests are queued and handled one at a time. This means your instances can quickly build a long queue and get overwhelmed. If you want to ensure your users never experience requests time, you should leave a lot of slack.
  • "cold_mult" : The multiple of your current load that is used to predict your future load, for example if you currently have 10 users, but expect there to be 20 in the near future, you can set cold_mult = 2.0. This should be set to 2.0 to begin for both LLMs and Text2Image.

  • "max_workers" : The maximum number of workers your endpoint group can have.

  • "cold_workers": The minimum number of workers you want to keep "cold" (meaning stopped and fully loaded) when your group has no load. Note that this is only taken into account if you already have workers which are fully loaded but are no longer needed. A good way to ensure that you have enough workers that are loaded is setting the "test_workers" parameter of the autogroup correctly.

2) Prepare the template #

Templates encapsulate all the information required to run an application with the autoscaler, including machine parameters, docker image, and environment variables.

For some of our popular templates, we have created a few autoscaler compatible templates that allow you to serve specific models in API mode on hardware that is best suited to the specific model. The templates we offer which are pre-configured to work with the autoscaler can be found on our Autoscaler Templates page. You can create a autogroup using one of those templates by specifying the template_hash on autogroup creation.

Note: The public pre-configured templates should not be used as they don't have HF_TOKEN variable not set. You must create a private copy of those templates that have HF_TOKEN set to your huggingface API token and use the template hash of those private templates instead. Huggingface API token is needed to download gated models

Go to the Autoscaler TGI template, create a new private copy of this template and set the HF_TOKEN to your Huggingface API token and MODEL_ID to "meta-llama/Meta-Llama-3-8B-Instruct"

Llama3 is a gated model, so be sure to go to the Huggingface model page for meta-llama/Meta-Llama-3-8B-Instruct while logged into your Huggingface account and accept the terms and conditions of the model.

Your Huggingface API token should be a READ type token. A Fine-grained token works as well, as long as it has the "Read access to contents of all public gated repos you can access" permission.

3) Create an Autoscaling Group #

Endpoint groups consist of one or more "Autoscaling Groups" (aka autogroups). Autogroups describe the machine configurations and parameters that will serve the requests, and they can be fully defined by a template. Use the template_hash of the template created in the previous step, and the endpoint_name from step 1:

1 vastai create autoscaler --endpoint_name "TGI-Llama3" --template_hash "$TEMPLATE_HASH" --test_workers 5
  • "test_workers" : Min number of workers to create while initializing autogroup. This allows the autogroup to get performance estimates from machines running your configurations before deploying them to serve your endpoint. This will also allow you to create workers which are fully loaded and "stopped" (aka "cold") so that they can be started quickly when you introduce load to your endpoint.

Note that if you don't create an endpoint group explictly before creating your autogroup, the endpoint group with your given name will be created in your account, you will just need to make sure that the endpoint group parameters (--cold_mult --min_load --target_util) are set correctly.

Once you have an autogroup to define your machine configuration and an endpoint group to define a managed endpoint for your API, Vast's autoscaling server will go to work automatically finding offers and creating instances from them for your api endpoint. The instances the autoscaler creates will be accessible from your account and will have a tag corresponding to the name of your endpoint group.

4) Send a request to your Endpoint Group #

It might take a few minutes for your first instances to be created and for the model to be downloaded onto them. For instances with low bandwidth, it can take up to 15 minutes to download a large model such as Flux. Once an instance has fully loaded the model, you can call the /route/ endpoint to obtain the address of your API endpoint on one of your worker servers. If no workers are ready, the route endpoint will indicate the number of loading workers in the "status" field of the returned JSON.

You can see what metrics are being sent to the Autoscaler in the instance logs. Your instance is loaded and benchmarked when cur_perf value is non-zero

Install the TLS certificate #

All of Vast.ai's autoscaler templates use SSL by default. If you want to disable it, you can add -e USE_SSL=false to the Docker options in your copy of the template. The Autoscaler will automatically adjust the instance URL to enable or disable SSL as needed.

  1. Download Vast AI's certificate from here.
  2. In the Python environment where you're running the client script, execute the following command:
    python3 -m certifi
    
    If you encounter a "No module named certifi" error, install the module by running:
    pip install certifi
    
  3. The command in step 2 will print the path to a file where certificates are stored. Append Vast AI's certificate to that file using the following command:
    cat jvastai_root.cer >> PATH/TO/CERT/STORE
    
    • You may need to run the above command with sudo if you are not running Python in a virtual environment.

Note: This process only adds Vast AI's TLS certificate as a trusted certificate for Python clients. If you need to add the certificate system-wide on Windows or MacOS, follow the steps outlined here.

For non-Python clients, you'll need to add the certificate to the trusted certificates for that specific client. If you encounter any issues, feel free to contact us on support chat for assistance.

Client Code #

Here is an example of calling the https://run.vast.ai/route/ endpoint, and then forwarding a model request to the returned worker address. You can find TGI's endpoints and payload format here

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 import requests from typing import Dict, Any from urllib.parse import urljoin def get_auth_data(api_key: str, endpoint_name: str, cost: int) -> Dict[str, Any]: """ auth_data sent back is in this format: { signature: str cost: str endpoint: str reqnum: int url: str } `url` is the IP address and port of the instance. The rest of the data is used for authentication. """ response = requests.post( "https://run.vast.ai/route/", json={ "api_key": api_key, "endpoint": endpoint_name, "cost": cost, }, ) return response.json() def get_endpoint_group_response( api_key: str, endpoint_name: str, cost: int, inputs: str, parameters: Dict[str, Any] ): auth_data = get_auth_data(api_key, endpoint_name, cost) # Payload format should follow the format referenced in "Templates Reference" page in autoscaler docs. # In this example, payload is formatted for TGI payload = {"inputs": inputs, "parameters": parameters} # this is the format of requests for all implementations of PyWorker. auth_data is always the same data # returned by autoscaler's `/route` endpoint. pyworker_payload = {"auth_data": auth_data, "payload": payload} # Use the returned URL + your expected endpoint # For TGI, `/generate` endpoint is the PyWorker endpoint for generating an LLM response url = urljoin(auth_data["url"], "/generate") response = requests.post(f"{url}", json=pyworker_payload) return response.text # Example Usage # this should be your Vast api key API_KEY = "YOUR_VAST_API_KEY" # endpoint_name from step 1 ENDPOINT_NAME = "TGI-Llama3" # cost is estimated number of tokens for request. # For TGI, a good default is max_new_tokens # For Comfy UI, the calculation is more complex, but a good default is 200 cost = 256 # You will also need to provide a payload object with your endpoint's expected query parameters # In this example, we are using an expected payload for our TGI example in our docs inputs = "What is the best movie of all time?" parameters = {"max_new_tokens": cost} reply_from_endpoint = get_endpoint_group_response( api_key=API_KEY, endpoint_name=ENDPOINT_NAME, cost=cost, inputs=inputs, parameters=parameters, ) print("Request sent to endpoint: ", inputs) print("Response from endpoint: ", reply_from_endpoint)

For a full working client for all backends, see PyWorker. Client script can be found in workers/$BACKEND/client.py. You can use these scripts to test your endpoint groups.

Since we have created a TGI template, we'll use the TGI client to test our endpoint. Install the requirements with pip install -r requirements.txt, and run the client TGI:

1 python3 -m workers.tgi.client -k "$API_KEY" -e "TGI-Llama3"

You should get two responses printed out, first is a synchronous, full response, and the second is streaming, where the model response is printed one token at a time.

5) Monitor your Groups #

There is an endpoint on the autoscaler server so that you can access the logs corresponding to your endpoint group, and autogroups which is described here

There is also an endpoint which allows you to see metrics for your groups, which is described here

6) Load testing #

There is a script for each backend to load test your instances. -n flag indicates the total number of requests to send, and -rps flag indicates the rate(requests/second). The script will print out some statistics on many requests are being handled per minute. You can run it by installing the required python packages similar to step 4, and run the following command:

1 python3 -m workers.tgi.test_load -n 1000 -rps 1 -k "$API_KEY" -e "TGI"