By now, just about everyone has either heard of large language models (LLMs) or has used them before. Not as many people are familiar with small language models (SLMs), however.

The names offer a clue – large and small – but size isn't the only thing that sets them apart.

So what are the differences between LLMs and SLMs? When do you actually need a large model, and when might a small one be the better choice?

In this post, we'll answer these questions and more. Here's what you need to know.

### Large Language Models vs. Small Language Models

An AI language model is a type of artificial intelligence that is trained on huge text datasets to **understand and generate human language.** These models use probabilistic machine learning to predict which words are most likely to appear next in a phrase sequence – and, in doing so, are able to generate text that resembles how people actually write and speak.

In practice, this ability is what underpins natural language processing (NLP) systems that let machines interact with us in useful ways.

When it comes to LLMs and SLMs specifically, though, let's look at how the two compare.

#### What Are LLMs?

With **massive parameter counts** in the billions and even trillions, most modern LLMs rely on the transformer architecture, which uses a **self-attention mechanism** to make sense of relationships between words across a sequence. This allows them to model incredibly complex patterns and long-range dependencies in language.

Since they're **trained on extensive datasets over a variety of domains**, LLMs are highly versatile. They have broad general knowledge and can generalize well across different tasks – even multitasking effectively across domains.

The downside is that training and running LLMs requires **significant computational power.** It's a resource-intensive process that often involves costly hardware and large-scale distributed computing infrastructure.

#### What Are SLMs?

At the basic level, an SLM is a smaller version of an LLM. It has **fewer parameters** – in the millions to low billions – and its architecture is focused on efficiency.

For instance, some SLMs use a **sliding window attention** mechanism (where the model focuses on a fixed-length "window" that slides across the text), and others rely on techniques like **grouped-query attention** (GQA), **sparse attention patterns**, and **low-rank adaptation** (LoRA) to improve efficiency and lower compute costs.

Unlike LLMs, SLMs are **trained on domain-specific data** or smaller datasets tailored to particular tasks. Although they may lack broad general knowledge, they do well in their specific domains and can even be fine-tuned for niche or regulated industries like finance and healthcare.

Because of their smaller size, SLMs need **less computational power** and can often be trained and deployed on more modest hardware.

### LLMs vs. SLMs: Strengts and Limitations

Both LLMs and SLMs have advantages and drawbacks. Here's a quick side-by-side comparison of their respective pros and cons:

<table class="w-full border-collapse border border-gray-300">
    <thead>
        <tr class="bg-gray-100">
            <th class="border border-gray-300 px-4 py-2 text-left font-semibold">
                Model Type
            </th>
            <th class="border border-gray-300 px-4 py-2 text-left font-semibold">
                Strengths
            </th>
            <th class="border border-gray-300 px-4 py-2 text-left font-semibold">
                Limitations
            </th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td class="border border-gray-300 px-4 py-2 font-semibold">LLM</td>
            <td class="border border-gray-300 px-4 py-2">
                Broad general knowledge across different domains
                <br />
                Can handle complex, open-ended tasks with more contextual
                understanding
                <br />
                Greater multimodal potential
            </td>
            <td class="border border-gray-300 px-4 py-2">
                Requires extensive compute resources and costly hardware
                <br />
                Slower inference and higher latency
                <br />
                Not optimized for specific tasks, and higher risk of bias due to
                unfiltered training data
            </td>
        </tr>
        <tr class="bg-gray-50">
            <td class="border border-gray-300 px-4 py-2 font-semibold">SLM</td>
            <td class="border border-gray-300 px-4 py-2">
                Easier to fine-tune for specific domains
                <br />
                Uses less compute resources
                <br />
                Faster inference and lower latency
                <br />
                Stronger data control and easier on-prem deployment
            </td>
            <td class="border border-gray-300 px-4 py-2">
                Narrower focus and less general knowledge; may require
                retraining for new tasks
                <br />
                Struggles with highly complex or long-context tasks
                <br />
                Limited multitasking capability
            </td>
        </tr>
    </tbody>
</table>

So how do you decide which type of model is right for your needs?

### Choosing the Right Model

As we've covered, neither model type is universally superior. The value of each one depends on what you're trying to accomplish and the resources you have available. That said, here are some guidelines to keep in mind.

#### An LLM may be the best choice if:

-   You need a model for **open-ended, multi-domain applications** such as general-purpose chatbots or creative content generation.  

-   Your use case involves **long-range context** across different subject areas.  

-   You have access to **powerful compute resources** and can manage the costs involved.

#### An SLM may be right for you if:

-   You prefer a model that is **lightweight and efficient**, suited for **resource-constrained environments.**  

-   Your application is **narrow or domain-specific**, such as FAQ bots or translation from one language to another.  

-   You want to **fine-tune with proprietary data** or maintain stricter control in regulated industries.

At the same time, you don't necessarily even have to choose one or the other; a hybrid approach could be the way to go. An SLM can take on routine tasks while an LLM handles more nuanced issues. For instance, you could use an SLM for standard customer queries and escalate more complex problems to an LLM.

### Final Thoughts

Both LLMs and SLMs offer unique advantages. Getting the most out of them often depends on having scalable, affordable compute – and that's where **Vast.ai** comes in.

Our cloud GPU platform gives you the flexibility to run a lightweight SLM on a single GPU or to train a massive LLM across [distributed clusters](https://vast.ai/products/clusters), all at a fraction of the typical cost. With Vast.ai, you can **save up to 5–6X** compared to traditional cloud providers – and train, fine-tune, and deploy AI models on your own terms.

Ready to get started? Spin up GPUs on demand with [Vast.ai](http://Vast.ai) today\!


Learn the key differences between Large Language Models (LLMs) and Small Language Models (SLMs), their strengths, limitations, and when to use each.

LLMs vs. SLMs: What's the Difference, and Why Does It Matter?


# Breaking New Ground with Llama 3.1

This week, Meta [launched](https://ai.meta.com/blog/meta-llama-3-1/) the Llama 3.1 collection of large language models (LLMs). It consists of three new models – pre-trained and instruction-tuned text in/text out open-source generative AI models – with parameter counts of 8B, 70B, and 405B.

According to Meta, the flagship 405B version is "the world's largest and most capable openly available foundation model."

## Open-Source Approach and Innovation

CEO Mark Zuckerberg champions the open-source approach, [predicting](https://www.facebook.com/4/posts/10115716861061241/) that it will eventually become the industry standard, much like Linux did for operating systems. He asserts that open-source AI models not only develop more rapidly but also offer greater innovation potential compared to their proprietary, closed-source counterparts.

The global AI community has indeed been energized by the release of Llama 3.1, with plenty of discussions and exploration around its potential. Here's what you need to know!

## Earlier Goals and Recent Achievements

Earlier this year, when the first, smaller, Llama 3 models were released ([Llama 2](https://vast.ai/article/running-the-70B-LLama2-GPTQ)), Meta [stated](https://ai.meta.com/blog/meta-llama-3/) that its goal in the near future is "to make Llama 3 multilingual and multimodal, have longer context, and continue to improve overall performance across LLM capabilities such as reasoning and coding."

With Llama 3.1, it's made great strides toward achieving that goal. The LLM isn't multimodal yet, but it does boast new multilingual capabilities (in Spanish, Portuguese, Italian, German, and Thai), as well as expanded tool use and drastically increased context length. Trained using over 16,000 of NVIDIA's H100 GPUs on a massive dataset of 15 trillion tokens, the 405B model is significantly more complex and powerful than its predecessors.

## Performance Benchmarks

Meta [says](https://ai.meta.com/blog/meta-llama-3-1/) that Llama 3.1 405B outperforms OpenAI's GPT-4 and GPT-4o as well as Anthropic's Claude 3.5 Sonnet on a number of benchmark tests. And across a range of different tasks, it's reportedly "competitive with" its closed-source rivals.

Here's how the 405B model compares to other cutting-edge LLMs across commonly used benchmarks (with Gemini not included because Meta had [difficulty](https://www.theverge.com/2024/7/23/24204055/meta-ai-llama-3-1-open-source-assistant-openai-chatgpt) using Google's APIs to replicate its results):

![table](/uploads/llama3.1.png)

## Model Architecture and Design

In a [blog](https://ai.meta.com/blog/meta-llama-3-1/) introducing Llama 3.1, Meta specified that the model's full training stack was "significantly optimized." Design choices prioritized scalability and simplicity for the model development process.

For instance, to maximize training stability, Llama 3.1 uses a standard decoder-only transformer model architecture with minor adaptations instead of a mixture-of-experts model. Meta also adopted an iterative post-training procedure, using supervised fine-tuning and direct preference optimization for each round. The result was the creation of superior-quality synthetic data with each iteration, enhancing the performance of every capability.

The 405B model itself was even used to improve the post-training quality of the smaller 70B and 8B models.

Notably, in order to facilitate large-scale production inference for a model at the 405B's scale, Meta transitioned from 16-bit (BF16) to 8-bit (FP8) numerics. This effectively reduces the compute requirements and enables the model to run within a single server node.

Users can now enjoy a longer context window, as well. The Llama 3.1 models have a context length that's been expanded from 8,192 tokens in Llama 3 to 128,000 tokens in Llama 3.1. That's about 16 times as much!

In fact, the expanded context length is now much greater than that of GPT-4 and about equal to what enterprise users get with GPT-4o – and pretty comparable to the 200,000 token window of Claude 3.

On top of that, periods of high demand won't affect access because Llama 3.1 can be deployed on your own hardware or chosen cloud provider. Generally, there won't be broad usage limits, either.

## Using and Building with Llama 3.1 405B

As such a powerful model, the 405B will require significant compute resources and developer expertise to work with it. Meta explicitly states that it wants users to get the most out of it – to take advantage of its advanced capabilities and start building immediately. The following are some possibilities:

- Real-time and batch inference
- Supervised fine-tuning, including on a specific domain
- LLM-as-a-judge (evaluation of your model for your specific application)
- Continual pre-training
- Retrieval-Augmented Generation (RAG)
- Function calling
- Synthetic data generation

Ahmad Al-Dahle, Meta's VP of generative AI, [predicts](https://www.theverge.com/2024/7/23/24204055/meta-ai-llama-3-1-open-source-assistant-openai-chatgpt) that knowledge distillation will be a popular use of the 405B model for developers. That is, it can be used as a larger "teacher" model that distills its knowledge and emergent abilities into a smaller "student" model with faster and more cost-effective inference.

Another example: Al-Dahle says that Llama 3.1 can integrate with a search engine API to "retrieve information from the Internet based on a complex query and call multiple tools in succession in order to complete your tasks." If you ask the model to plot the number of homes sold in the United States over the last five years, "it can retrieve the [web] search for you and generate the Python code and execute it." Not bad.

The Llama ecosystem also offers turnkey directions for various use cases and advanced workflows for anyone to use. Meta has partnered with projects like vLLM, TensorRT, and PyTorch to build in support right from the start, making it easier for users to get started.

## Moving Forward

Ultimately, Llama 3.1 represents an important leap forward in the pursuit of open, accessible, and responsible AI innovation.

There's some [debate](https://www.platformer.news/meta-llama-3-zuckerberg-open-source-ai/) as to whether open-source models are safer than closed source in the long run. Mark Zuckerberg believes they are. And numerous organizations have signed on to the [AI Alliance](https://newsroom.ibm.com/AI-Alliance-Launches-as-an-International-Community-of-Leading-Technology-Developers,-Researchers,-and-Adopters-Collaborating-Together-to-Advance-Open,-Safe,-Responsible-AI) with Meta and IBM to promote this vision for the future of open AI – including CERN, Hugging Face, Intel, Linux Foundation, NASA, and Oracle.

Here at Vast.ai, we appreciate the accessibility of these open-source LLMs – along with the collaboration of the community around them. Our own mission aligns with this philosophy of democratizing AI for everyone.

To that end, we're pleased to be able to offer the open-source Text Generation Interface (TGI) framework on Vast, so you can serve LLMs like Llama 3.1 and run your own models with much more affordable compute.

Check out our guide on [Serving Online Inference with TGI on Vast.ai](https://vast.ai/article/serving-online-inference-with-tgi-on-vastai)!


Discover Meta's groundbreaking Llama 3.1, the world's largest and most capable open-source AI model, pushing the boundaries of innovation and accessibility.

Meta Launches Llama 3.1: A New Era in Open-Source AI


# Serving Online Inference with vLLM on Vast.ai

## Background

vLLM is an open source framework for Large Language model inference. It specifically focuses on throughput for serving and batch workloads. This is important for building apps for multiple users and at scale.

vLLM provides an OpenAI compatible server, which means that you can integrate it into chatbots, and other applications

As companies build out their AI products, they often hit roadblocks like rate limits and cost for using these models. With vLLM on Vast, you can run your own models in the form factor you need, but with much more affordable compute. As inference grows in demand with agents and complicated workflows, vLLM on Vast shines for performance and affordability where you need it the most.

This guide will show you how to setup vLLM to serve an LLM on Vast. We reference a notebook that you can use [here](https://nbviewer.org/urls/bitbucket.org/%21api/2.0/snippets/jsbcannell/XEyMo8/72acf45925f8230b75195b7da9fff1884fa4052d/files/serving_vllm_on_vast.json)

## Setup and Querying

First, we setup our environment and vast api key

```bash
pip install --upgrade vastai
```

Once you create your account, you can go [here](https://cloud.vast.ai/cli/) to find your API Key.

```bash
vastai set api-key <Your-API-Key-Here>
```

For serving an LLM, we're looking for a machine that has a static IP address, ports available to host on, plus a single modern GPU with decent RAM since we're going to serve a single small model. `vLLM` also requires Cuda version 12.4 or higher, so we will filter for that as well. We will query the vast API to get a list of these types of machines.

```bash
vastai search offers 'compute_cap > 800 gpu_ram > 20 num_gpus = 1 static_ip=true direct_port_count > 1 cuda_vers >= 12.4'
```

## Deploying the Image:

The easiest way to deploy this instance is to use the command line. Copy and Paste a specific instance id you choose from the list above into `instance-id` below.

```bash
vastai create instance <instance-id> --image vllm/vllm-openai:latest --env '-p 8000:8000' --disk 40 --args --model stabilityai/stablelm-2-zephyr-1_6b
```

## Connecting and Testing:

To connect to your instance, we'll first need to get the IP address and port number. Once your instance is done setting up, you should see something like this:
![Instance_view](/uploads/instance_view_vllm.png)

Click on the highlighted button to see the IP address and correct port for our requests.

![IP_address_view](/uploads/ip_address_view_vllm.png)

We will copy over the IP address and the port into the cell below.

```bash
# This request assumes you haven't changed the model. If you did, fill it in the "model" value in the payload json below
curl -X POST http://<IP-Address>:<Port>/v1/completions -H "Content-Type: application/json"  -d '{"model" : "stabilityai/stablelm-2-zephyr-1_6b", "prompt": "Hello, how are you?", "max_tokens": 50}'
```

You will see a response from your model in the output. Your model is up and running on Vast!

In the [notebook](https://nbviewer.org/urls/bitbucket.org/%21api/2.0/snippets/jsbcannell/XEyMo8/72acf45925f8230b75195b7da9fff1884fa4052d/files/serving_vllm_on_vast.json), we include ways to call this model with requests or OpenAI

## Advanced Usage: Serving a Quantized Llama-3-70b Model:

Now that we've spun up a model on vLLM, we can get into more complicated deployments. We'll work on serving this specific quantized Llama-3 70B [model](https://huggingface.co/casperhansen/llama-3-70b-instruct-awq).

With this quantized model, we can easilly serve this model on on 4 4090 GPU's.

### What is different this time around:

1. The model string - we need to use the new model id.
2. We're going to use 4 GPU's instead of just 1.
3. We need to provision much more space on our system to be able to download the full set of weights. 100 GB in this case should be fine.
4. We need to set up tensor parallelism inside vLLM to split up the model across these 4 gpus.
5. We need to let vLLM know that this is a quantized model

First, we will search for instances that match our needs

```bash
vastai search offers 'compute_cap >= 800 gpu_ram >= 24 num_gpus = 4 static_ip=true direct_port_count > 1 cuda_vers >= 12.4'
```

In our instance creation, we will increase our disk usage to 100GB.

Then, we will tell vllm to: 1. use the specific model, 2. split across 4 GPU's, and 3. Let it know that it is in fact a quantized model.

```bash
vastai create instance <Instance-ID> --image vllm/vllm-openai:latest --env '-p 8000:8000' --disk 100 --args --model casperhansen/llama-3-70b-instruct-awq --tensor-parallel-size 4  --quantization awq
```

### Other things to look out for with other configurations:

If you are downloading a model that needs authentication from the huggingface hub, passing `-e HF_token=<Your-Read-Only-Token>` within vast's `--env` variable string should help.

Sometimes the full context of a model can't be used given the space allocated for vLLM on the GPU + the models size. In those cases, you might want to increase `--gpu-memory-utilization`, or decrease the `max-model-len`. Increasing `--gpu-memory-utilization` does come with CUDA OutOfMemory Issues that can be hard to predict ahead of time.

We won't need either of these for this specific model and GPU configuration.

### Testing:

Copy the IP address from your instance once it is ready, and then we can use the following code to call it. Note that while your server might have ports ready, the model might not have downloaded yet as it is much larger this time. You can check the status of this via the logs to see when it has started serving.

```bash
import requests

headers = {
    'Content-Type': 'application/json',
}

json_data = {
    'model': 'casperhansen/llama-3-70b-instruct-awq',
    'prompt': 'Hello, how are you?',
    'max_tokens': 50,
}

response = requests.post('http://<Instance-IP-Address>:<Port>/v1/completions', headers=headers, json=json_data)
print(response.content)
```

Or use Open AI:

```bash
pip install openai
```

```python
from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://<Instance-IP-Address>:<Port>/v1"
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
completion = client.completions.create(model="casperhansen/llama-3-70b-instruct-awq",
                                      prompt="Hello, how are you?",
                                      max_tokens=50)
print("Completion result:", completion)
```

## Conclusions

Model inference is expensive, and leveraging more affordable compute/models makes a huge difference for engineering teams in terms of margins, and shipping velocity.

Using vLLM on Vast is perfect for this, pairing Vast's access to affordable compute with the simplicity and State of the Art throughput of the vLLM backend.

vLLM is a great beginning to building Generative AI Apps. We will continue to explore using this tool more with Vast in future posts.

Llama-3 is already ready to go on Vast to start experimenting and building!


Leverage the power of vLLM for efficient and scalable language model inference on Vast.ai's affordable compute infrastructure.

Posts about: Large Language Models

LLMs vs. SLMs: What's the Difference, and Why Does It Matter?

Meta Launches Llama 3.1: A New Era in Open-Source AI

Serving Online Inference with vLLM API on Vast.ai