Getting Started with Ollama - Running LLMs Locally

Elliot Forbes · Mar 7, 2026 · 7 min read

If you’ve been wanting to experiment with large language models but don’t want to pay for API calls or send your data to third-party servers, Ollama might be exactly what you need. In this tutorial, we’ll walk through everything you need to know to get Ollama up and running on your machine.

What is Ollama and Why Should You Care?

Ollama is a lightweight tool that makes it dead simple to run large language models (LLMs) locally on your own hardware. Think of it as a bridge between you and powerful AI models like Llama 3, Mistral, and others—but without needing to understand all the complex setup involved.

Here’s why running LLMs locally is pretty great:

Privacy and Security: Your prompts and outputs never leave your machine. If you’re working with sensitive data or just value your privacy, this is a huge win. You’re not sending anything to OpenAI, Anthropic, or any other company.

Speed: Once a model is running locally, there’s no network latency. No waiting for API responses. You get instant feedback, which is fantastic for iterative work and development.

Cost Savings: No API fees to worry about. Once you’ve downloaded a model (which happens once), you can run it as many times as you want without paying a cent.

Offline Access: Internet down? No problem. Your local models keep running. This is invaluable for development work in areas with spotty connectivity.

Control and Customization: You have complete control over your models and can fine-tune them to your specific needs if you want to get advanced.

Installing Ollama

Getting Ollama installed is straightforward. Head over to ollama.ai and download the version for your operating system. Let’s walk through each platform:

macOS

If you’re on a Mac, just download the .dmg file and drag Ollama into your Applications folder. It’s that simple. Once installed, you’ll see the Ollama icon in your menu bar.

Linux

For Linux users, you can use the install script:

curl https://ollama.ai/install.sh | sh

This will download and install Ollama along with all its dependencies. If you prefer not to use the script, you can also find platform-specific packages on the GitHub releases page.

Windows

Windows users can download the installer executable from the website. Run it, follow the prompts, and you’re done. Ollama will integrate with your system, and you can access it from the command line or through the GUI that starts automatically.

Pulling and Running Your First Model

Once Ollama is installed, open your terminal and let’s get your first model running. The process is incredibly straightforward.

Let’s start with Llama 2, a popular open-source model:

ollama run llama2

That’s it. Ollama will:

  1. Check if you have the model locally
  2. If not, download it (this might take a few minutes depending on your internet speed)
  3. Load it into memory
  4. Drop you into an interactive chat session

You can now type prompts and chat with the model. Try asking it something like:

>>> Tell me a joke about programming

When you’re done, just type exit or press Ctrl+D to quit.

If you prefer Mistral, which is smaller and runs faster on modest hardware, you can do:

ollama run mistral

Or if you want to try a smaller model that’s perfect for testing, try Orca Mini:

ollama run orca-mini

Ollama CLI Basics

Once you’re familiar with running models, here are the essential CLI commands you’ll use regularly:

Running a Model

ollama run <model-name>

Listing Available Models

Want to see what models you have downloaded locally?

ollama list

This shows you the model name, tags, size, and when it was last used.

Pulling a Model

You don’t need to run a model to download it. You can pull it ahead of time:

ollama pull llama2

This is useful if you want to download models when you have time, rather than waiting during a work session.

Removing a Model

If you need to free up disk space, remove a model:

ollama rm llama2

This deletes the model from your local storage.

Using the Ollama REST API

Here’s where things get really powerful. Ollama exposes a REST API on localhost:11434, which means you can integrate it into your applications, scripts, and workflows.

Making a Simple Request

Let’s start with a basic curl request:

curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

The API returns a JSON response with the model’s output. Setting "stream": false gives you the complete response at once.

Streaming Responses

If you want responses streamed back to you (useful for long outputs), set "stream": true:

curl http://localhost:11434/api/generate -d '{
  "model": "mistral",
  "prompt": "Write a short story about a robot",
  "stream": true
}'

This will output newline-delimited JSON chunks as they’re generated, giving you real-time updates.

Using the Chat API

For a more natural conversational interface, use the chat endpoint:

curl http://localhost:11434/api/chat -d '{
  "model": "llama2",
  "messages": [
    { "role": "user", "content": "Hello! What can you help me with?" }
  ]
}'

This maintains conversation context and feels more like chatting with an AI.

Python Example

Here’s a quick Python snippet to interact with Ollama:

import requests
import json

def ask_ollama(prompt, model="mistral"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False
        }
    )
    return response.json()["response"]

if __name__ == "__main__":
    result = ask_ollama("Explain quantum computing in simple terms")
    print(result)

Choosing the Right Model for Your Use Case

Not all models are created equal. Here’s how to think about the tradeoffs:

Model Size and Speed: Smaller models run faster and use less RAM. Mistral is around 4GB, while Llama 2 is 7GB or 13GB depending on the variant. Orca Mini is even smaller at around 2GB.

Quality vs. Resources: Larger models tend to produce better, more nuanced responses. If you have the hardware, a 13B parameter model will generally outperform a 7B model. But if you’re on a laptop, the smaller model might be your sweet spot.

Specialized Models: Some models are fine-tuned for specific tasks. Phind, for example, is optimized for code-related queries. Choosing a specialized model can give you better results for your particular use case.

Latency Requirements: If you’re building an application where user experience matters, faster response times from smaller models might be worth the slight quality trade-off.

My recommendation for most people starting out: begin with Mistral. It’s fast, produces good quality responses, and won’t bog down your machine.

System Requirements and Hardware Considerations

Before you dive in, let’s talk about what you actually need:

RAM: This is the big one. As a rule of thumb, allocate RAM roughly equal to the model size. A 7B parameter model needs around 7-8GB of RAM. A 13B model needs 13-16GB. You’ll also need extra RAM for your operating system and other applications, so aim for at least 16GB total if you want comfortable breathing room.

CPU: A modern multi-core CPU helps, but it’s not the limiting factor. You can run these models on older hardware, but they’ll be slower.

GPU: This is where things get interesting. If you have a GPU, Ollama can use it to accelerate inference significantly. For NVIDIA cards, this is handled automatically if you have CUDA installed. Mac users with Apple Silicon get GPU acceleration out of the box.

Disk Space: Models range from 2GB to 40GB+, so have at least 50GB of free space if you plan to experiment with multiple models.

Custom Modelfiles for System Prompts

Want to customize how a model behaves? Ollama supports Modelfiles, which let you define custom system prompts and parameters.

Here’s an example Modelfile that creates a helpful coding assistant:

FROM llama2

SYSTEM You are an expert software engineer. You provide clear, concise code examples and explanations. Always ask clarifying questions before diving into complex solutions.

PARAMETER temperature 0.7
PARAMETER top_k 40
PARAMETER top_p 0.9

Save this as Modelfile (no extension), then create your custom model:

ollama create coding-helper -f Modelfile

Now run it:

ollama run coding-helper

Your model will now respond with the personality and style you’ve defined. You can tweak parameters to adjust creativity (temperature), randomness, and other behaviors.

Wrapping Up

You’re now ready to run large language models on your own machine. Start by installing Ollama, pulling a model, and experimenting. The barrier to entry is incredibly low, and the possibilities are genuinely exciting.

Whether you’re building applications, exploring AI capabilities, or just curious about how these models work, Ollama gives you a playground to experiment without worrying about costs or privacy.

The beauty of running LLMs locally is that you can iterate quickly, customize models to your needs, and keep your data private. Give it a shot, and I think you’ll find it’s a game-changer for how you work with AI.

Happy experimenting!