Docs - Use Cases

Autoscaler Guide

Please note the Autoscaler is currently in Beta, and is subject to changes and downtime.

Some popular templates on Vast such as Oobabooga LLM WebUI and AUTOMATIC1111 SD WebUI can be run in API mode to act as inference servers as the backend of an application. Vast's autoscaling service allows users to automatically set up instances to do so, and automates instance creation / deletion, performance tracking, and error handling. The autoscaler also provides authentication services to ensure requests coming to your vast instances are only coming from approved clients.

Note that this guide assumes knowledge of the Vast CLI, and an introduction for it can be found here

1) Choose your autoscaler compatible template #

Templates encapsulate all the information required to run an application with the autoscaler, including machine parameters, docker image, and environment variables. For each of our popular templates, we have created a few autoscaler compatible templates that allow you to serve specific models in API mode on hardware that is best suited to the specific model. The templates we offer which are pre-configured to work with the autoscaler can be found here. You can create a autogroup using one of those templates by specifying the template_hash on autogroup creation.

2) Create your Endpoint Group #

To use API endpoint, you need to create an "Endpoint Group" (aka endptgroup) that manages your endpoint in response to incoming load.

You can do this through the CLI using this command:

vastai create endpoint --endpoint_name "my-endpoint" --cold_mult 1.0 --min_load 0.0 --target_util 0.9 --max_workers 20 --cold_workers 5
  • "min_load" : This is the baseline amount of load (tokens / second for LLMs) you want your autoscaling group to be able to handle. A good default is 100.0

  • "target_util" : The percentage of your autogroup compute resources that you want to be in-use at any given time. A lower value allows for more slack, which means your instance group will be less likely to be overwhelmed if there is a sudden spike in usage.

  • "cold_mult" : The multiple of your current load that is used to predict your future load, for example if you currently have 10 users, but expect there to be 20 in the near future, you can set cold_mult = 2.0. This should be set to 2.0 to begin.

  • "max_workers" : The maximum number of workers your endpoint group can have.

  • "cold_workers": The mimimum number of workers you want to keep "cold" (meaning stopped and fully loaded) when your group has no load. Note that this is only taken into account if you already have workers which are fully loaded but are no longer needed. A good way to ensure that you have enough workers that are loaded is setting the "test_workers" parameter of the autogroup correctly.

3) Create an Autoscaling Group #

Endpoint groups consist of one or more "Autoscaling Groups" (aka autogroups). autogroups describe the machine configurations and parameters that will serve the requests, and they can be fully defined by a template. For example, see the following command to add a autoscaling group defined by the tgi-llama2-7B-quantized template to your endpoint group.

vastai create autoscaler --endpoint_name "my-endpoint" --template_hash "3f19d605a70f4896e8a717dfe6b517a2" --test_workers 5
  • "test_workers" : Min number of workers to create while initializing autogroup. This allows the autogroup to get performance estimates from machines running your configurations before deploying them to serve your endpoint. This will also allow you to create workers which are fully loaded and "stopped" (aka "cold") so that they can be started quickly when you introduce load to your endpoint.

Note that if you don't create an endpoint group explictly before creating your autogroup, the endpoint group with your given name will be created in your account, you will just need to make sure that the endpoint group parameters (--cold_mult --min_load --target_util) are set correctly.

Once you have an autogroup to define your machine configuiration and an endpoint group to define a managed endpoint for your API, Vast's autoscaling server will go to work automatically finding offers and creating instances from them for your api endpoint. The instances the autoscaler creates will be accessible from your account and will have a tag corresponding to the name of your endpoint group.

4) Send a request to your Endpoint Group #

It might take a few minutes for your first instances to be created and for the model to be downloaded onto them. Once you have an instance that has the model is fully loaded, you will be able to call the /route/ endpoint to get the address of your API endpoint on one of your worker servers. If you don't have any ready workers, the route endpoint will tell you the number of loading workers in the "status" field of the return json.

Here is an example of calling the https://run.vast.ai/route/ endpoint, and then forwarding a model request to the returned worker address.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 # Call /route endpoint on autoscaler server route_payload = { "endpoint": ENDPOINT_NAME, "api_key": API_KEY, "cost": 256 } response = requests.post("https://run.vast.ai/route/", headers={"Content-Type": "application/json"}, data=json.dumps(route_payload), timeout=4) if response.status_code != 200: print(f"Failed to get worker address, response.status_code: {response.status_code}") return message = response.json() worker_address = message['url'] # Call /generate endpoint on worker generate_payload = message generate_url = f"{worker_address}/generate" # the following fields would be sent from the client to the proxy server generate_payload["inputs"] = "What is the best movie of all time?" generate_payload["parameters"] = {"max_new_tokens" : 256} generate_response = requests.post(generate_url, headers={"Content-Type": "application/json"}, json=generate_payload) if generate_response.status_code != 200: print(f"Failed to call /generate endpoint for {generate_url}") return print(f"Response from {generate_url}:", generate_response.text)

Please note that the backend endpoint on the worker instance will depend on what backend you are running, and more information about the endpoits can be found here

5) Monitor your Groups #

There is an endpoint on the autoscaler server so that you can access the logs corresponding to your endpoint group, and autogroups which is described here

There is also an endpoint which allows you to see metrics for your groups, which is described here