Docs - Use Cases

Huggingface TGI with LLama3

This is a guide on how to setup and expose an API for Llama3 Text Generation.

For this guide the model will be unquantized and using the 8GB version.

Login to your Vast account on the console

Select the HuggingFace Llama3 TGI API template by clicking the link provider

For this template we will be using the meta-llama/Meta-Llama-3-8B-Instruct model, and the TGI 2.0.4 from Huggingface

Templates encapsulate all the information required to run an application with the autoscaler, including machine parameters, docker image, and environment variables.

For this template, the only requirement is that you have your own Huggingface access token. You will also need to apply to have access to Llama3 on huggingface in order to access this gated repository.

The template comes with some filters that are minimum requirements for TGI to run effectively. This includes but is not limited to a disk space requirement of 100GB, and a gpu ram requirement of at least 16GB.

After selecting the template your screen should look like this:

2) Modifying the Template #

You will fail to run this template if you do not supply your huggingface access token and you do not have access to the gated repository from huggingface for Meta's LLama3

Once you have selected the template, you will need to then add in your huggingface token and click the 'Select & Save' button.

You can add your huggingface token with the rest of the docker run options.

This is the only modification you will need to make on this template.

You can then press 'Select & Save' to get ready to launch your instance.

3) Rent a GPU #

Once you have selected the template, you can then choose to rent a GPU of your choice from either the search page or the CLI/API.

For someone just getting started I recommend either an Nvidia RTX 4090, or an A5000.

4) Monitor Your Instance #

Once you rent a GPU your instance will being spinning up on the Instances page.

You know the API will be ready when your instance looks like this:

Once your instance is ready you will need to find where your API is exposed. Go to the IP & Config by pressing the blue button on the top of the instance card. You can see the networking configuration here.

After opening the IP & Port Config you should see a forwarded port from 5001, this is where your API resides. To hit TGI you can use the '/generate' endpoint on that port.

Here is an example:

5) Congratulations! #

You now have a running instance with an API that is using TGI loaded up with Llama3 8B!

Serverless/Autoscaler Guide #

As you use TGI you may want to scale up to higher loads. We currently offer a serverless version of the Huggingface TGI via a template built to run with the Autoscaler.

The Autoscaler is built to run dynamic workloads that change over time, and allow you to run multiple versions of the same model while allocating resources as effectively as possible.

1) Select the Autoscaler Template #

In order to use the Autoscaler version of TGI you must use the Autoscaler version of the template which can be found in the Autoscaler section on the Templates page.

When you find it and select it, it should look like this on the search page:

The setup is essentially the same where all you have to do is supply your huggingface access token, and then you will be good to go.

2) Create your Endpoint Group #

To use API endpoint, you need to create an "Endpoint Group" (aka endptgroup) that manages your endpoint in response to incoming load.

You can do this through the CLI using this command:

vastai create endpoint --endpoint_name "my-endpoint" --cold_mult 1.0 --min_load 100 --target_util 0.9 --max_workers 20 --cold_workers 5
  • "min_load" : This is the baseline amount of load (tokens / second for LLMs) you want your autoscaling group to be able to handle. A good default is 100.0

  • "target_util" : The percentage of your autogroup compute resources that you want to be in-use at any given time. A lower value allows for more slack, which means your instance group will be less likely to be overwhelmed if there is a sudden spike in usage.

  • "cold_mult" : The multiple of your current load that is used to predict your future load, for example if you currently have 10 users, but expect there to be 20 in the near future, you can set cold_mult = 2.0. This should be set to 2.0 to begin.

  • "max_workers" : The maximum number of workers your endpoint group can have.

  • "cold_workers": The mimimum number of workers you want to keep "cold" (meaning stopped and fully loaded) when your group has no load. Note that this is only taken into account if you already have workers which are fully loaded but are no longer needed. A good way to ensure that you have enough workers that are loaded is setting the "test_workers" parameter of the autogroup correctly.

3) Create an Autoscaling Group #

Endpoint groups consist of one or more "Autoscaling Groups" (aka autogroups). For example, see the following command to add a autoscaling group defined by the meta-llama/Meta-Llama-3-8B-Instruct template to your endpoint group.

vastai create autoscaler --endpoint_name "my-endpoint" --template_hash "XXXXXXXXXXXXXXXXXXXXXX" --test_workers 5
  • "template_hash" : Your template hash after adding your huggingface access key
  • "test_workers" : Min number of workers to create while initializing autogroup. This allows the autogroup to get performance estimates from machines running your configurations before deploying them to serve your endpoint. This will also allow you to create workers which are fully loaded and "stopped" (aka "cold") so that they can be started quickly when you introduce load to your endpoint.

Note: If you don't create an endpoint group explictly before creating your autogroup, the endpoint group with your given name will be created in your account, you will just need to make sure that the endpoint group parameters (--cold_mult --min_load --target_util) are set correctly.

Once you have an autogroup to define your machine configuration and an endpoint group to define a managed endpoint for your API, Vast's autoscaling server will go to work automatically finding offers and creating instances from them for your api endpoint. The instances the autoscaler creates will be accessible from your account and will have a tag corresponding to the name of your endpoint group.

4) Send a request to your Endpoint Group #

Once your instance is loaded and running you will be able to call the /route/ endpoint to get the address of your API endpoint on one of your worker servers. If you don't have any ready workers, the route endpoint will tell you the number of loading workers in the "status" field of the return.

Here is an example file that calls the https://run.vast.ai/route/ endpoint, and then forwarding a model request to the returned worker address.

You can copy this example.py and run this yourself with:

python example.py "my-endpoint" <YOUR API KEY>

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 import requests import json import argparse def main(args): # Call /route endpoint on autoscaler server route_payload = { "endpoint": args.endpoint_name, "api_key": args.api_key, "cost": 256, } response = requests.post( "https://run.vast.ai/route/", headers={"Content-Type": "application/json"}, data=json.dumps(route_payload), timeout=4, ) if response.status_code != 200: print( f"Failed to get worker address, response.status_code: {response.status_code}" ) return message = response.json() worker_address = message["url"] # Call /generate endpoint on worker generate_payload = message generate_url = f"{worker_address}/generate" # the following fields would be sent from the client to the proxy server generate_payload["inputs"] = "What is the best movie of all time?" generate_payload["parameters"] = {"max_new_tokens": 256} generate_response = requests.post( generate_url, headers={"Content-Type": "application/json"}, json=generate_payload, ) if generate_response.status_code != 200: print(f"Failed to call /generate endpoint for {generate_url}") return print(f"Response from {generate_url}:", generate_response.text) if __name__ == "__main__": parser = argparse.ArgumentParser(description="Call an API at a specified rate") parser.add_argument("endpoint_name", type=str, help="Name of the endpoint to call") parser.add_argument("api_key", type=str, help="API key for the endpoint") args = parser.parse_args() main(args)

Please note that the backend endpoint on the worker instance will depend on what backend you are running, and more information about the endpoints can be found here

5) Monitor your Groups #

There is an endpoint on the autoscaler server so that you can access the logs corresponding to your endpoint group, and autogroups which is described here

There is also an endpoint which allows you to see metrics for your groups, which is described here