Docs - Guides


Please note the Autoscaler is currently in Beta, and is subject to changes and downtime.

The Autoscaler allows you to manage a group of instances, called an autogroup, to serve an application, and scale up or down according to customer traffic. The autoscaler automates instance creation / deletion, load tracking, performance tracking, and error handling.

To learn more about the general architecture of the Autoscaler, and how to integrate it with your application, see the autoscaler architecture docs page.

1) Setup your Vast account #

The first thing to do if you are new to Vast is to create an account and verify your email address. Then head to the Billing tab and add credits. Vast uses Stripe to processes credit card payments and also accepts major cryptocurrencies through Coinbase or $20 should be enough to start. You can setup auto-top ups so that your credit card is charged when your balance is low.

2) Understand the Autoscaler metrics #

The autoscaler works in terms of units of "load". In the context of LLMs, this is measured in tokens / second. The total capacity of your autogroup is measured in these units, and we will start and stop instances based on comparing their load capacity with your current and estimated load. As part of the code that is run on the worker instance there are performance tests to get estimates of their token generation capabilities, and then we will update these with the actual performance of the servers in production. The code for the performance tests we are currently running can be found here. Depending on your use case, it might be helpful to run a more specialized performance test, so if it would be helpful for you to do so, let us know and we can work with you to include your custom performance test. To learn more about the performance tests, see here.

3) Create an Autoscaling group (autogroup) #

To use the autoscaler, you first must make an autogroup through the GUI or the CLI.

To create an autogroup, here are the relevant fields.

"endpoint_name" : Each autogroup belongs to an endpoint group which has a unique name. Endpoint groups allow you to have multiple autogroups that each have different parameters (such as different "search_params" or "launch_args").

"search_params" : These are the arguments used in the search offers command that is run to find instances for the autoscaler to create.

"launch_args" : These are the arguments used in the create instance command that is run to create an instance. This is where you will specify information that is stored in a template from the GUI, such as the docker image to run, disk size, networking specifications, and the onstart script.

"min_load" : This is the baseline amount of load (tokens / second for LLMs) you want your autoscaling group to be able to handle.

"target_util" : The percentage of your autogroup compute resources that you want to be in-use at any given time. A lower value allows for more slack, which means your instance group will be less likely to be overwhelmed if there is a sudden spike in usage.

"cold_mult" : The multiple of your current load that is used to predict your future load, for example if you currently have 10 users, but expect there to be 20 in the near future, you can set cold_mult = 2.0. This should be set to 1.0 to begin.

"api_key: : The api key associated with the Vast account that manages this autogroup

To use the autoscaler with text-generation-inference as a backend, the fields to use are as follows. To learn how to use the autoscaler with other supported backends, see here

"search_params" : This will depend on the model you want to run, but we have been testing with TheBloke/Llama-2-70B-chat-GPTQ and using the following parameters:

gpu_ram>=23 num_gpus=4 gpu_name=RTX_4090 inet_down>128 direct_port_count>3 disk_space>=192 driver_version>=535086005 rented=False

"launch_args" : The "launch_args" we use (except for HUGGING_FACE_HUB_TOKEN):

--onstart-cmd 'wget -O - | bash' --env '-e MODEL_ARGS="--model-id TheBloke/Llama-2-70B-chat-GPTQ --quantize gptq" -e HUGGING_FACE_HUB_TOKEN=YOUR_TOKEN_HERE -p 3000:3000' --image --disk 200 --ssh --direct

A number of different things are set here, so it is best to highlight the most important components

--on_start_cmd 'wget -O - | bash'
    This script starts the vast-pyworker framework with the TGI backend, which means it allows you to integrate the text-generation-inference server with the autoscaling service. 
    This configures your insatances to use the text-generation-inference image.

MODEL_ARGS="--model-id TheBloke/Llama-2-70B-chat-GPTQ --quantize gptq"
    These are the arguments that are given to the "text-generation-launcher" executable (in addition to others that are used to set it up in server mode) to
    run the model that will serve client requests.

    This is necessary if you are using a gated model from huggingface.
-p 3000:3000
    This opens port 3000 to allow the vast-pyworker server to listen for external requests. 

Once you create an autogroup, it will be given a id.

4) Use your autoscaling group #

Now that you have created an autoscaling group, instances should be created automatically, depending on how large your "min_load" and other parameters are.

The autoscaler architecture page explains the general framework for using the autoscaler to serve client requests. At the moment, the user (as in the vast account holder) is responsible for acting as the proxy server as explained in the autoscaler architecture, but we are likely to offer our own proxy services in the near future.

The interface between the proxy server and the autoscaler server happens through the endpoint, and the interface is documented here.

Once the proxy server gets a worker address, it will need to call the appropriate endpoints on the worker, which will depend on your vast-pyworker backend. The endpoints for tgi are defined here and the more information about the vast-pyworker backends generally can be found here.

Here is an example of a proxy server calling the endpoint, and then forwarding a client request to the returned worker address.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 # Call /route endpoint on autoscaler server route_payload = { "endpoint": ENDPOINT_NAME, "api_key": API_KEY, "cost": 256 } response ="", headers={"Content-Type": "application/json"}, data=json.dumps(route_payload), timeout=4) if response.status_code != 200: print(f"Failed to get worker address, response.status_code: {response.status_code}") return message = response.json() worker_address = message['url'] # Call /generate endpoint on worker generate_payload = message generate_url = f"{worker_address}/generate" # the following fields would be sent from the client to the proxy server generate_payload["inputs"] = "What is the best movie of all time?" generate_payload["parameters"] = {"max_new_tokens" : 256} generate_response =, headers={"Content-Type": "application/json"}, json=generate_payload) if generate_response.status_code != 200: print(f"Failed to call /generate endpoint for {generate_url}") return print(f"Response from {generate_url}:", generate_response.text)

5) Autoscaler best practices #

Different servers vary in performance, and also compatability with the software you are trying to run on them. We have automated systems to track when a server encounters an error, and will automatically replace it, but if there are issues you want to diagnose, you can see our debugging guide here. Also, if you are having any issues with the service, please reach out to us at, or through the support page on our website.