Blog

Running Deep Cogito on Vast.ai

- Team Vast

April 22, 2025-Deep CogitoVast.aiLLMHybrid AI

Deep Cogito models are hybrid reasoning Large Language Models (LLMs) uniquely designed to operate in two distinct modes within a single deployment: standard direct answering and detailed step-by-step reasoning. Unlike traditional models that require separate deployments for different reasoning capabilities, Deep Cogito can seamlessly toggle between these modes using simple prompt engineering.

This guide demonstrates how to deploy deepcogito/cogito-v1-preview-llama-8B on Vast.ai and leverage its dual reasoning capabilities through vLLM's OpenAI-compatible API.

Deploying on Vast.ai

First, install the Vast.ai SDK and configure your API key:

pip install --upgrade vastai
export VAST_API_KEY="YOUR_KEY_HERE"
vastai set api-key $VAST_API_KEY

Search for a suitable instance that meets the model's requirements:

vastai search offers "compute_cap >= 750 \
geolocation=US \
gpu_ram >= 60 \
num_gpus = 1 \
static_ip = true \
direct_port_count >= 1 \
verified = true \
disk_space >= 80 \
rentable = true"

Deploy the model using vLLM's OpenAI-compatible API:

vastai create instance YOUR_INSTANCE_ID --image vllm/vllm-openai:latest --env '-p 8000:8000' --disk 80 --args --model deepcogito/cogito-v1-preview-llama-8B

Using Deep Cogito with OpenAI SDK

Install the OpenAI SDK:

pip install --upgrade openai

Enabling Thinking Mode

Deep Cogito provides two methods to toggle thinking mode:

  1. Hugging Face Transformers Library Method: When using the Hugging Face Transformers library directly, you can enable thinking mode using the enable_thinking=True flag with the tokenizer's apply_chat_template method. This activates the model's reasoning process, displaying its thought process within <thinking> tags before providing the final answer.

  2. vLLM OpenAI-compatible API Method: When using vLLM's OpenAI-compatible API (as in our Vast.ai deployment), use a system prompt instead. The official prompt is:

    "Enable deep thinking subroutine."
    

    For the 8B model specifically, we recommend using this enhanced version for more consistent results:

    "Enable deep thinking subroutine. \n\n YOU MUST USE THE THINKING SUBROUTINE"
    

Here's an implementation example using the OpenAI SDK:

from openai import OpenAI

# Vast.ai instance configuration
VAST_IP_ADDRESS = "YOUR_VAST_IP"
VAST_PORT = "YOUR_VAST_PORT"

# Initialize the OpenAI client
client = OpenAI(
    api_key="EMPTY",  # vLLM doesn't require an API key
    base_url=f"http://{VAST_IP_ADDRESS}:{VAST_PORT}/v1"
)

# Function for reasoning mode
def chat_with_model_reasoning(prompt):
    response = client.chat.completions.create(
        model="deepcogito/cogito-v1-preview-llama-8B",
        messages=[
            {"role": "system", "content": "Enable deep thinking subroutine. \n\n YOU MUST USE THE THINKING SUBROUTINE"},
            {"role": "user", "content": prompt}
        ],
    )
    return response.choices[0].message.content

# Function for standard mode
def chat_with_model(prompt):
    response = client.chat.completions.create(
        model="deepcogito/cogito-v1-preview-llama-8B",
        messages=[
            {"role": "user", "content": prompt}
        ],
    )
    return response.choices[0].message.content

Response Comparison

The following examples demonstrate how the model's responses differ between thinking mode and standard mode when asked to explain Large Language Models:

With Thinking Mode:

<think>
Okay, I need to give a short introduction to Large Language Models (LLMs). Let me start by recalling what I know about them. LLMs are a type of artificial intelligence that uses neural networks to process and generate human-like text. They're trained on massive amounts of text data from the internet, books, and other sources. This training allows them to understand and generate contextually relevant responses.

Wait, I should probably mention that the training process involves machine learning, specifically deep learning. They use transformer architectures, which were introduced by Google in 2017. Transformers revolutionized NLP by allowing the model to process input in parallel, which makes them more efficient than previous sequential models.

But maybe I'm getting a bit too technical. The user might just want a basic overview. Let me focus on the key aspects. LLMs are known for their ability to understand natural language and generate coherent text. They can engage in conversations, answer questions, write stories, translate languages, and even help with coding. But they also have limitations, like sometimes providing incorrect or biased information.

I should also mention some popular examples, like GPT-3, BERT, and the more recent ones like GPT-4. These models vary in size and capability. For instance, GPT-3 has 175 billion parameters, making it one of the largest models. GPT-4 is even bigger and claimed to have more capabilities, though it's not publicly available yet.

Also, LLMs are used in various applications, such as chatbots, virtual assistants, and content generation tools. They can help with tasks like summarization, question answering, and sentiment analysis. But it's important to note that they're not perfect and can have issues like hallucinations (making up information) or biases inherited from the training data.

Hmm, I might need to simplify this a bit. Maybe start with a basic definition and then highlight their key features and applications. Avoid too much technical jargon unless necessary.

Wait, I should also mention the training data aspect. LLMs learn from enormous datasets, which means they can pick up on language patterns, idioms, and even cultural references. But this also means they can reflect biases present in the data, which is a concern.

In summary, LLMs are advanced AI models that can understand and generate human-like text using deep learning techniques. They excel at language tasks but have limitations. They're widely used in various applications but also require careful consideration of their ethical implications.
</think>

Large Language Models (LLMs) are advanced AI systems that can process and understand language in ways similar to human intelligence. These models use neural networks and are trained on vast amounts of text data from the internet, books, and other sources.

Here are the key things you should know about LLMs:

1. **How they work**: LLMs use a technology called transformer architecture, which allows them to process language in parallel and maintain context in conversations. This means they can understand what you're saying and respond appropriately.

2. **Capabilities**: LLMs are incredibly versatile, able to:
   - Engage in conversations
   - Answer questions
   - Write stories and articles
   - Help with coding
   - Translate languages
   - Summarize text
   - Create content

3. **Examples**: Popular LLMs include GPT-3, BERT, and ChatGPT. These models vary in size and capability, but they all share the ability to understand and generate human-like text.

4. **Important limitations**: While powerful, LLMs aren't perfect. They can:
   - Make up information (hallucinations)
   - Reflect biases from their training data
   - Sometimes provide incorrect answers

5. **Applications**: LLMs are used in:
   - Chatbots
   - Virtual assistants
   - Content generation tools
   - Language translation services
   - Text analysis and processing

As these models continue to evolve, they're becoming increasingly important in our digital world, though we need to remain mindful of their limitations and potential biases.

Without Thinking Mode:

Large Language Models (LLMs) are AI systems that use neural networks to process and generate human-like text. They're pre-trained on massive datasets of text and can understand context, generate coherent responses, and perform various language tasks like translation, summarization, and question-answering.

The most well-known LLMs include GPT (Generative Pre-trained Transformer) models created by OpenAI, BERT (Bidirectional Encoder Representations from Transformers) created by Google, and others. They've become increasingly useful for applications like chatbots, content generation, and language understanding tasks.

Key characteristics of LLMs include their ability to understand context, generate human-like text, and adapt to different tasks, though they can also sometimes produce incorrect or biased information.

Conclusion

Deep Cogito on Vast.ai offers a cost-effective solution for accessing a powerful reasoning LLM. The system prompt technique enables visibility into the model's reasoning process, making it particularly valuable for complex tasks such as coding, mathematical problems, and detailed analysis.

While the enable_thinking flag isn't available through vLLM's API, the system prompt approach can be highly effective with proper configuration. Users should experiment with prompt variations to achieve optimal results, as factors such as emphasis and repetition can significantly impact the consistency of the thinking mode, especially with the 8B model.

For production environments where understanding the model's reasoning process is crucial, this approach provides a robust and economical solution. We recommend testing different prompt variations to find the optimal configuration for your specific use case, as the effectiveness of thinking mode can vary depending on the complexity of the task and the chosen model size.

Share on
  • Contact
  • Get in Touch