This notebook shows how to serve a large language model on Vast's GPU platform using the popular open source inference framework [vLLM](https://github.com/vllm-project/vllm). `vLLM` is particularly good at high-throughput serving, for multi user or high load use-cases, and is one of the most popular serving frameworks today.

The commands in this notebook can be run here, or copied and pasted into your terminal (Minus the `!` or the `%%bash`). At the end, we will include a way to query your `vLLM` service in either python or with a curl request for the terminal.

In [None]:
%%bash
#In an environment of your choice
pip install --upgrade vastai

In [None]:
%%bash
# Here we will set our api key
vastai set api-key <Your-API-Key-Here>


Now we are going to look for GPU's on vast. The model that we are using is going to be very small, but to allow for easilly swapping out the model you desire, we will select machines that:
1. Have GPU's with Ampere or newer architecture
2. Have at least 24gb of GPU RAM (to run 13B parameter LLMs)
3. One GPU as `vLLM` primarilly serves one copy of a model.
4. Have a static IP address to route requests to
5. Have direct port counts available (greater than 1) to enable port reservations
6. Use Cuda 12.1 or higher due to `vLLM`'s base image

In [None]:
%%bash
vastai search offers 'compute_cap >= 800 gpu_ram >= 24 num_gpus = 1 static_ip=true direct_port_count > 1 cuda_vers >= 12.1' 


Copy and Paste the id of a machine that you would like to choose below for `<instance-id>`.
We will activate this instance with the `vLLM-OpenAI` template. This template gives us a docker image that uses `vLLM` behind an OpenAI Compatible server. This means that it can slide in to any application that uses the openAI api. All you need to change in your app is the `base_url` and the `model_id` to the model that you are using so that the requests are properly routed to your model.

This command also exposes the port 8000 in the docker container, the default openAI server port, and tells the docker container to automatically download and serve the `stabilityai/stablelm-2-zephyr-1_6b`. You can change the model by using any HuggingFace model ID. We chose this because it is fast to download and start playing with.

We use vast's `--args` command to funnel the rest of the command to the container, in this case `--model stabilityai/stablelm-2-zephyr-1_6b`, which `vLLM` uses to download the model.

In [None]:
%%bash

vastai create instance <instance-id> --image vllm/vllm-openai:latest --env '-p 8000:8000' --disk 40 --args --model stabilityai/stablelm-2-zephyr-1_6b

Now, we need to verify that our setup is working. We first need to wait for our machine to download the image and the model and start serving. This will take a few minutes. The logs will show you when it's done. 

Then, at the top of the instance, there is a button with an IP address in it. Click this and a panel will show up of the ip address and the forwarded ports. 
You should see something like: 
Copy and paste the IP address and the port in the curl command below.

This curl command sends and OpenAI compatible request to your vLLM server. You should see the response if everything is setup correctly. 

In [None]:
%%bash
# This request assumes you haven't changed the model. If you did, fill it in the "model" value in the payload json below
curl -X POST http://<Instance-IP-Address>:<Port>/v1/completions -H "Content-Type: application/json"  -d '{"model" : "stabilityai/stablelm-2-zephyr-1_6b", "prompt": "Hello, how are you?", "max_tokens": 50}'


This next cell replicates exactly the same request but in the python requests library. If you're looking to build off of this more, we recommend checking out the [OpenAI sdk](https://github.com/openai/openai-python).


In [None]:
import requests

headers = {
    'Content-Type': 'application/json',
}

json_data = {
    'model': 'stabilityai/stablelm-2-zephyr-1_6b',
    'prompt': 'Hello, how are you?',
    'max_tokens': 50,
}

response = requests.post('http://<Instance-IP-Address>:<Port>/v1/completions', headers=headers, json=json_data)
print(response.content)

# Advanced vLLM Usage: Quantized Llama-3-70b-Instruct
Now that we've spun up a model on vLLM, we can get into more complicated deployments. We'll work on serving this specific quantized Llama-3 70B [model](https://huggingface.co/casperhansen/llama-3-70b-instruct-awq).
With this quantized model, we can easilly serve this model on on 4 4090 GPU's.

Overall, A few things need to change:
1. The model string need to change to our new model.
2. We're going to use 4 GPU's
3. We need to provision much more space on our system to be able to download the full set of weights. 100 GB in this case should be fine
4. We need to set up tensor parallelism inside vLLM to split up the model across these 4 gpus. 
5. We need to let vLLM know that this is a quantized model



In [None]:
%%bash
vastai search offers 'compute_cap >= 800 gpu_ram >= 24 num_gpus = 4 static_ip=true direct_port_count > 1 cuda_vers >= 12.1' 


We will make a similar search as before, but include parameters to ensure at least 4 GPUs.

In our instance creation, we will increase our disk usage to 100GB.

Then, we will tell vllm to: 1. use the specific model, 2. split across 4 GPU's, and 3. Let it know that it is in fact a quantized model.

In [None]:
%%bash

vastai create instance <Instance-ID> --image vllm/vllm-openai:latest --env '-p 8000:8000' --disk 100 --args --model casperhansen/llama-3-70b-instruct-awq --tensor-parallel-size 4  --quantization awq 

In [None]:
%%bash
# This request assumes you haven't changed the model. If you did, fill it in the "model" value in the payload json below
curl -X POST http://<Instance-IP-Address>:<Port>/v1/completions -H "Content-Type: application/json"  -d '{"model" : "casperhansen/llama-3-70b-instruct-awq", "prompt": "Hello, how are you?", "max_tokens": 50}'


In [None]:
import requests

headers = {
    'Content-Type': 'application/json',
}

json_data = {
    'model': 'casperhansen/llama-3-70b-instruct-awq',
    'prompt': 'Hello, how are you?',
    'max_tokens': 50,
}

response = requests.post('http://<Instance-IP-Address>:<Port>/v1/completions', headers=headers, json=json_data)
print(response.content)