Harness the Power of Cloud GPUs with Vast.AI: Running the 70B LLama2 GPTQ

August 15, 2023

3 Min Read

By Team Vast

Harness the Power of Cloud GPUs with Vast.AI: Running the 70B LLama2 GPTQ

The rapid advancements in the AI domain have given rise to the need for powerful computational resources. If you're involved in data science or AI research, you're already aware of the immense processing capabilities required to train complex models like the 70B LLama2 GPTQ. Fortunately, with cloud GPU rental services like Vast.AI, accessing these resources has become easier and more efficient than ever.

Quick Guide to Launch Oobabooga webUI on Vast.AI

For those interested in leveraging the groundbreaking 70B LLama2 GPTQ, TheBloke made this possible.

We’ve created a template to auto-launch the Oobabooga webUI. Notably, this template grants users the versatility of direct SSH and Jupyter (notebook) integration.

Users should note this particular model demands a staggering 40 GB of VRAM. To accommodate this, Vast.AI's interface has a VRAM slider to guide you in selecting suitable machines like the A6000, A40, and A100.

For the 70B Model, we recommend the A6000 (currently $0.50/hr) or A40 (currently $0.40/hr). To run the 13B model, we recommend either a 3090 (currently $0.20/hr) or 4090 (currently $0.48/hr). To see all current pricing, refer to our dynamic pricing page, or head to our search console.

Steps to Download and Execute the Model in text-generation-webui:

Navigate to the Model tab.
In the section labeled Download custom model or LoRA, input TheBloke/Llama-2-70B-chat-GPTQ.
For downloading from specific branches, for instance, TheBloke/Llama-2-70B-chat-GPTQ:gptq-4bit-32g-actorder_True, refer to the Provided Files section for a comprehensive list of branches for each choice.
Click Download. Wait for the model to download, and once it's completed, a "Done" message will appear.
Set the Loader to ExLlama if you're planning to utilize a 4-bit file. Alternatively, opt for AutoGPTQ or GPTQ-for-LLaMA.
Using AutoGPTQ? Ensure that the "No inject fused attention" option is selected.
On the top left, click the refresh icon adjacent to the Model label.
From the Model dropdown, select the freshly downloaded model: TheBloke/Llama-2-70B-chat-GPTQ.
Once the model automatically loads, it's set and ready for deployment!
Secure your settings by clicking Save settings for this model. Then, click Reload the Model in the top right to ensure your configurations are retained.
Now, head over to the Text Generation tab, input your desired prompt, and watch the magic unfold!

For those considering running LLama2 on GPUs like the 4090s and 3090s, TheBloke/Llama-2-13B-GPTQ is the model you'd want. Vast.AI's platform is diverse, offering a plethora of options tailored to meet your project's requirements.

For additional details and to delve deeper, please visit the official github page for ooba.

In today's competitive digital environment, it's crucial to have scalable and reliable computational resources at your fingertips. Vast.AI is revolutionizing the way researchers and developers access and utilize GPU power, making AI model training seamless and efficient. Whether you're an AI novice or an established researcher, Vast.AI is your go-to solution for all your cloud GPU rental needs.