Calling Ollama from a Go Application
In this tutorial, we’re going to explore how to interact with Ollama directly from a Go application. Ollama runs a local REST API that lets you generate text, handle chat conversations, create embeddings, and more. The best part? It’s super straightforward to work with from Go.
By the end of this guide, you’ll know how to make requests to Ollama, handle streaming responses, build a reusable client, and even create a simple chatbot API. Let’s dive in.
Prerequisites
Before we get started, you’ll need:
- Ollama installed on your machine (grab it from ollama.ai)
- At least one model pulled (try
ollama pull mistralorollama pull llama2) - A working Go development environment (version 1.16+)
- Basic familiarity with Go, HTTP requests, and JSON
Make sure Ollama is running before you try any of the examples. By default, it listens on http://localhost:11434.
Understanding Ollama’s API Endpoints
Ollama exposes several key endpoints that we’ll work with:
/api/generate- Generate text based on a prompt (supports streaming)/api/chat- Handle multi-turn conversations with message history/api/embeddings- Create vector embeddings for text/api/tags- List all pulled models
For most of our examples, we’ll focus on /api/generate and /api/chat since those are the most commonly used.
Making Your First Request
Let’s start simple. Here’s how to make a basic generation request to Ollama using Go’s standard net/http package:
package main
import (
"bytes"
"encoding/json"
"fmt"
"io"
"net/http"
)
func main() {
// Prepare the request payload
payload := map[string]interface{}{
"model": "mistral",
"prompt": "What is the capital of France?",
"stream": false,
}
payloadBytes, _ := json.Marshal(payload)
// Make the POST request
resp, err := http.Post(
"http://localhost:11434/api/generate",
"application/json",
bytes.NewBuffer(payloadBytes),
)
if err != nil {
fmt.Println("Error:", err)
return
}
defer resp.Body.Close()
// Read and parse the response
body, _ := io.ReadAll(resp.Body)
var result map[string]interface{}
json.Unmarshal(body, &result)
fmt.Println("Response:", result["response"])
}
This works, but there’s something important to know about Ollama: by default, it streams responses. Let’s handle that.
Handling Streaming Responses
When you don’t set "stream": false, Ollama sends back NDJSON (newline-delimited JSON). Each line is a separate JSON object. This is actually really efficient for getting incremental results as they’re generated.
Here’s how to handle streaming:
package main
import (
"bufio"
"bytes"
"encoding/json"
"fmt"
"net/http"
)
func main() {
payload := map[string]interface{}{
"model": "mistral",
"prompt": "Write a short poem about Go programming",
"stream": true, // Enable streaming
}
payloadBytes, _ := json.Marshal(payload)
resp, err := http.Post(
"http://localhost:11434/api/generate",
"application/json",
bytes.NewBuffer(payloadBytes),
)
if err != nil {
fmt.Println("Error:", err)
return
}
defer resp.Body.Close()
// Read the streaming response line by line
scanner := bufio.NewScanner(resp.Body)
for scanner.Scan() {
var result map[string]interface{}
json.Unmarshal(scanner.Bytes(), &result)
// Print each chunk of text as it arrives
if text, ok := result["response"].(string); ok {
fmt.Print(text)
}
}
fmt.Println() // Final newline
}
This approach lets you display results to the user as they come in, which feels much more responsive.
Building a Chat Application
For multi-turn conversations, Ollama provides the /api/chat endpoint. You send a list of messages with roles (user, assistant, system), and Ollama maintains context:
package main
import (
"bufio"
"bytes"
"encoding/json"
"fmt"
"net/http"
)
type Message struct {
Role string `json:"role"`
Content string `json:"content"`
}
type ChatRequest struct {
Model string `json:"model"`
Messages []Message `json:"messages"`
Stream bool `json:"stream"`
}
func chat(messages []Message) string {
req := ChatRequest{
Model: "mistral",
Messages: messages,
Stream: false,
}
payload, _ := json.Marshal(req)
resp, _ := http.Post(
"http://localhost:11434/api/chat",
"application/json",
bytes.NewBuffer(payload),
)
defer resp.Body.Close()
var result map[string]interface{}
json.NewDecoder(resp.Body).Decode(&result)
message := result["message"].(map[string]interface{})
return message["content"].(string)
}
func main() {
messages := []Message{
{Role: "user", Content: "What's your favorite programming language?"},
}
response := chat(messages)
fmt.Println("Assistant:", response)
// Continue the conversation
messages = append(messages, Message{Role: "assistant", Content: response})
messages = append(messages, Message{Role: "user", Content: "Why do you prefer it?"})
response = chat(messages)
fmt.Println("Assistant:", response)
}
Notice how we keep appending to the messages slice? That’s what gives the model context about the previous conversation.
Creating a Reusable Ollama Client
Writing the same HTTP logic repeatedly gets tedious. Let’s build a proper client struct:
package main
import (
"bufio"
"bytes"
"context"
"encoding/json"
"fmt"
"io"
"net/http"
"time"
)
type OllamaClient struct {
baseURL string
client *http.Client
}
type GenerateRequest struct {
Model string `json:"model"`
Prompt string `json:"prompt"`
Stream bool `json:"stream"`
}
type GenerateResponse struct {
Response string `json:"response"`
Done bool `json:"done"`
}
func NewOllamaClient(baseURL string) *OllamaClient {
return &OllamaClient{
baseURL: baseURL,
client: &http.Client{
Timeout: 5 * time.Minute,
},
}
}
func (c *OllamaClient) Generate(ctx context.Context, model, prompt string) (string, error) {
req := GenerateRequest{
Model: model,
Prompt: prompt,
Stream: false,
}
payload, _ := json.Marshal(req)
httpReq, _ := http.NewRequestWithContext(
ctx,
"POST",
c.baseURL+"/api/generate",
bytes.NewBuffer(payload),
)
httpReq.Header.Set("Content-Type", "application/json")
resp, err := c.client.Do(httpReq)
if err != nil {
return "", err
}
defer resp.Body.Close()
var result GenerateResponse
json.NewDecoder(resp.Body).Decode(&result)
return result.Response, nil
}
func (c *OllamaClient) GenerateStream(ctx context.Context, model, prompt string, callback func(string)) error {
req := GenerateRequest{
Model: model,
Prompt: prompt,
Stream: true,
}
payload, _ := json.Marshal(req)
httpReq, _ := http.NewRequestWithContext(
ctx,
"POST",
c.baseURL+"/api/generate",
bytes.NewBuffer(payload),
)
httpReq.Header.Set("Content-Type", "application/json")
resp, err := c.client.Do(httpReq)
if err != nil {
return err
}
defer resp.Body.Close()
scanner := bufio.NewScanner(resp.Body)
for scanner.Scan() {
var result GenerateResponse
json.Unmarshal(scanner.Bytes(), &result)
callback(result.Response)
}
return scanner.Err()
}
func main() {
client := NewOllamaClient("http://localhost:11434")
// Simple generation
ctx, cancel := context.WithTimeout(context.Background(), 2*time.Minute)
defer cancel()
response, err := client.Generate(ctx, "mistral", "Explain quantum computing briefly")
if err != nil {
fmt.Println("Error:", err)
return
}
fmt.Println(response)
// Streaming with callback
client.GenerateStream(ctx, "mistral", "Count to 5", func(chunk string) {
fmt.Print(chunk)
})
}
This client wraps the HTTP complexity and gives us a clean API to work with.
Using the Official ollama-go Library
If you prefer not to roll your own, there’s an official Go client library. Install it with:
go get github.com/ollama/ollama
Then use it like this:
package main
import (
"context"
"fmt"
"github.com/ollama/ollama/api"
)
func main() {
client, err := api.NewClient("http://localhost:11434", nil)
if err != nil {
fmt.Println("Error:", err)
return
}
req := &api.GenerateRequest{
Model: "mistral",
Prompt: "Hello, how are you?",
Stream: false,
}
response, err := client.Generate(context.Background(), req)
if err != nil {
fmt.Println("Error:", err)
return
}
fmt.Println(response)
}
Much simpler if you’re comfortable with external dependencies.
Handling Errors and Timeouts
Long-running generations can take time. Use context.WithTimeout to avoid hanging forever:
package main
import (
"bytes"
"context"
"encoding/json"
"fmt"
"net/http"
"time"
)
func generateWithTimeout(model, prompt string, timeout time.Duration) (string, error) {
ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel()
payload := map[string]interface{}{
"model": model,
"prompt": prompt,
"stream": false,
}
payloadBytes, _ := json.Marshal(payload)
req, _ := http.NewRequestWithContext(
ctx,
"POST",
"http://localhost:11434/api/generate",
bytes.NewBuffer(payloadBytes),
)
req.Header.Set("Content-Type", "application/json")
client := &http.Client{}
resp, err := client.Do(req)
if err != nil {
if ctx.Err() == context.DeadlineExceeded {
return "", fmt.Errorf("generation timed out after %v", timeout)
}
return "", err
}
defer resp.Body.Close()
var result map[string]interface{}
json.NewDecoder(resp.Body).Decode(&result)
return result["response"].(string), nil
}
func main() {
response, err := generateWithTimeout("mistral", "Tell me a joke", 30*time.Second)
if err != nil {
fmt.Println("Error:", err)
return
}
fmt.Println(response)
}
This approach prevents your application from hanging if Ollama becomes unresponsive.
Building a Chatbot API Server
Let’s tie everything together and create a simple HTTP server that wraps Ollama as a chatbot API:
package main
import (
"encoding/json"
"fmt"
"net/http"
"sync"
)
type Message struct {
Role string `json:"role"`
Content string `json:"content"`
}
type ChatRequest struct {
Message string `json:"message"`
}
type ChatResponse struct {
Reply string `json:"reply"`
}
var (
conversationHistory []Message
mu sync.Mutex
)
func chatHandler(w http.ResponseWriter, r *http.Request) {
if r.Method != http.MethodPost {
http.Error(w, "Method not allowed", http.StatusMethodNotAllowed)
return
}
var req ChatRequest
json.NewDecoder(r.Body).Decode(&req)
mu.Lock()
conversationHistory = append(conversationHistory, Message{
Role: "user",
Content: req.Message,
})
mu.Unlock()
// Build the chat request with full history
chatReq := map[string]interface{}{
"model": "mistral",
"messages": conversationHistory,
"stream": false,
}
payloadBytes, _ := json.Marshal(chatReq)
resp, err := http.Post(
"http://localhost:11434/api/chat",
"application/json",
bytes.NewBuffer(payloadBytes),
)
if err != nil {
http.Error(w, "Ollama error", http.StatusInternalServerError)
return
}
defer resp.Body.Close()
var ollamaResp map[string]interface{}
json.NewDecoder(resp.Body).Decode(&ollamaResp)
message := ollamaResp["message"].(map[string]interface{})
assistantResponse := message["content"].(string)
mu.Lock()
conversationHistory = append(conversationHistory, Message{
Role: "assistant",
Content: assistantResponse,
})
mu.Unlock()
w.Header().Set("Content-Type", "application/json")
json.NewEncoder(w).Encode(ChatResponse{Reply: assistantResponse})
}
func main() {
http.HandleFunc("/chat", chatHandler)
fmt.Println("Server running on :8080")
http.ListenAndServe(":8080", nil)
}
You can test this with:
curl -X POST http://localhost:8080/chat \
-H "Content-Type: application/json" \
-d '{"message": "Hello! What can you do?"}'
The server maintains conversation history, so follow-up questions will include context.
Wrapping Up
You now know how to work with Ollama from Go. We covered the basics of making requests, handling streaming responses, building a reusable client, and creating a practical API server.
From here, you could add features like model selection, persistent conversation storage, rate limiting, or integration with your existing Go applications. The possibilities are endless.
Remember to always handle timeouts and errors gracefully, and you’ll have a solid foundation for building AI-powered Go applications.
Happy coding!
Continue Learning
Building AI Agents in Go
Learn how to build AI agents in Go that can use tools, make decisions, and complete tasks autonomously.
Building AI Applications with LangChainGo
Learn how to build AI-powered applications in Go using the LangChainGo library with practical examples.
Building RAG Applications in Go
Learn how to build Retrieval Augmented Generation (RAG) applications in Go using LangChainGo and Ollama.
Getting Started with Ollama - Running LLMs Locally
In this tutorial, we'll look at how you can get Ollama set up on your machine and start running large language models locally.