Train a 70b language model on a 2X RTX 3090/4090 with QLoRA and FSDP

- Team Vast

March 12, 2024-Industry

Train a 70b language model on a 2X RTX 4090 with QLoRA and FSDP


A new exciting announcement from demonstrated a way to train a larger model, such as Llama 2 70B on 48GB of GPU RAM. Furthermore it can run on multiple GPUS, so it is possible to train a model on a 2X 4090 instance!

The new fsdp_qlora Open Source software has a number of training options and examples. The software builds on two big ideas:

QLoRA (Quantized LoRA) combines quantization (using fewer bits to store model weights) and LoRA (Low-Rank Adaptation, which adds small trainable matrices to a frozen base model). This allows training models larger than the GPU's memory by using a quantized base model with trainable LoRA adapters. However, QLoRA still has limitations, such as requiring expensive GPUs and restricting sequence lengths and batch sizes due to memory constraints.

FSDP (Fully Sharded Data Parallel) is a library developed by Meta that efficiently splits a large model across multiple GPUs, allowing them to be used simultaneously. It improves upon the previous gold standard, DDP (Distributed Data Parallel), which required the full model to fit on each GPU. FSDP enables training large models that exceed the memory of a single GPU by sharding the model parameters across multiple GPUs.

This guide will show you how to rent a 2X instance on to run this software on-demand.

How to run fsdp_qlora on a 2X 3090/4090 instance has a large supply of RTX 3090 and RTX 4090 GPUs well suited to run fsdp_qlora. You will need a account setup with credits. This guide uses the Vast Python CLI.

Create your Vast account, add credits and install the Vast CLI on your local machine.

Setup the account and verify your email address. Add some credits to your account for the GPU rental. Then install the CLI.

Search the marketplace

We will need to find suitable GPUs. From my testing, for training Llama 70B it required 200GB of system RAM. This query will return on-demand GPU offers for suitable 2X 4090 instances sorted by download bandwidth

For 2X 4090 instances

vastai search offers "gpu_name=RTX_4090 cpu_ram>=130 disk_space>140 num_gpus=2" -o "inet_down-"

For 2X 3090 instances

vastai search offers "gpu_name=RTX_3090 cpu_ram>=130 disk_space>140 num_gpus=2" -o "inet_down-"

Create the instance

The list of offers will have a lot of details about the machines that are available. To pick one, you will need the offer id. Then use that in the create command along with the Pytorch Cuda Devel template hash.

This command will create an instance on the offer supplied, using that template and allocating 200GB of disk.

Replace the XXXXXXX with the offer ID from the marketplace.

vastai create instance XXXXXXX --template_hash e4c5e88bc289f4eecb0c955c4fe7430d --disk 200

SSH into the instance

Wait a few minutes for the instance to boot up. You can use the webui to get the direct SSH command by clicking on the >_ button.

To connect via SSH, you will simply use the command copied from the instance card. It will look similar to this:

ssh -p <port> <sshURL> -L 8080:localhost:8080

Install fsdp_qlora and login to Hugging Face

git clone
pip install llama-recipes fastcore --extra-index-url
pip install bitsandbytes>=0.43.0
pip install wandb

Login to Hugging Face by pasting in your API key to download the model

huggingface-cli login

Optional: login to wandb to enable logging to Weights and Balances

wandb login

Optionally setup HQQ

HQQ is a fast and accurate model quantizer. It can be used instead of bitsandbytes. If you want to use it, you will need to install it first.

git clone

Install HQQ

cd hqq && pip install .

Train the model

For complete options look at the Github repo. This example will finetune Llama 70B using hqq_lora.

Example: finetune Llama 2 70B with context length of 512

cd ~/fsdp_qlora
python \
--model_name meta-llama/Llama-2-70b-hf \
--batch_size 2 \
--context_length 512 \
--precision bf16 \
--train_type qlora \
--use_gradient_checkpointing true \
--use_cpu_offload true \
--dataset alpaca \
--reentrant_checkpointing true


On some instances, the system seems to only run on the CPU. The GPUs might not being recognized. Monitor the fsdp_qlora repo for help.

For any issues related to the instance or working with, click on the website support chat at the bottom right corner of for immediate support.

Sometimes it can help to export the cuda visible devices to help with an issue where the GPUs do not load and the CPU does all the work.

Share on
  • Contact
  • Get in Touch