Building RAG Applications in Go

Elliot Forbes · Mar 7, 2026 · 10 min read

If you’ve been keeping up with the AI world lately, you’ve probably heard about Large Language Models (LLMs) being amazing but also having a bit of a problem: they can make stuff up. They’ll confidently tell you facts that are completely wrong. This is where RAG comes in, and it’s honestly one of the most practical ways to make LLMs actually useful for your specific domain.

In this guide, we’re going to build a RAG application in Go that can answer questions based on documents you provide. By the end, you’ll have a working system that retrieves relevant information from your documents and uses an LLM to generate accurate answers.

What is RAG and Why Should You Care?

RAG stands for Retrieval Augmented Generation. Instead of just asking an LLM a question and hoping it knows the answer, we give it the relevant documents first. It’s like handing someone a research paper before asking them to explain what’s in it.

Here’s why this matters:

Grounding in Reality: LLMs are trained on general knowledge, but they don’t know about your proprietary data, internal documentation, or domain-specific information. RAG solves this by letting you inject your own context into the conversation.

Reducing Hallucinations: When an LLM doesn’t have information, it makes things up (we call this hallucinating). By providing relevant documents, we dramatically reduce false information in the responses.

Cost Efficient: Instead of fine-tuning a massive model on your data (expensive!), we just retrieve what’s relevant and include it in the prompt. This keeps costs down while maintaining quality.

How RAG Works: The Pipeline

RAG has a pretty straightforward pipeline:

  1. Document Loading: Read your documents (PDFs, text files, web pages, etc.)
  2. Splitting: Break documents into manageable chunks (LLMs have context limits)
  3. Embedding: Convert text chunks into vector representations
  4. Storage: Store these vectors in a vector database
  5. Retrieval: When a user asks a question, find similar chunks using vector similarity
  6. Generation: Pass the retrieved chunks + original question to an LLM to generate an answer

The magic happens because vectors let us find “similar” content efficiently. Even if your question uses different words than your documents, the vectors understand the meaning.

Setting Up Your Project

Let’s start by creating a new Go project. We’ll use LangChainGo (a Go port of LangChain) and Ollama for running LLMs locally.

First, make sure you have Go installed, then create a new project:

mkdir rag-app && cd rag-app
go mod init github.com/yourusername/rag-app

Now let’s add our dependencies:

go get github.com/tmc/langchaingo
go get github.com/tmc/langchaingo/llms/ollama
go get github.com/tmc/langchaingo/embeddings
go get github.com/tmc/langchaingo/vectorstores/memorystore

You’ll also need Ollama installed and running. If you haven’t already, download it from ollama.ai. Once installed, pull a model:

ollama pull mistral
ollama pull nomic-embed-text

The first is our LLM, the second is our embedding model. You can use different models if you prefer - Ollama has lots available.

Loading and Splitting Documents

Let’s start with a simple function to load documents. For this example, we’ll work with text files, which keeps things straightforward:

package main

import (
	"fmt"
	"io/ioutil"
	"strings"
	"unicode"
)

type Document struct {
	Content   string
	Metadata  map[string]interface{}
}

func loadDocument(filePath string) (*Document, error) {
	content, err := ioutil.ReadFile(filePath)
	if err != nil {
		return nil, fmt.Errorf("failed to read file: %w", err)
	}

	return &Document{
		Content: string(content),
		Metadata: map[string]interface{}{
			"source": filePath,
		},
	}, nil
}

func splitDocument(doc *Document, chunkSize int, overlap int) []*Document {
	var chunks []*Document
	content := doc.Content

	// Split by sentences first for cleaner breaks
	sentences := splitBySentence(content)

	currentChunk := ""

	for _, sentence := range sentences {
		if len(currentChunk) + len(sentence) > chunkSize {
			if currentChunk != "" {
				chunks = append(chunks, &Document{
					Content:  strings.TrimSpace(currentChunk),
					Metadata: doc.Metadata,
				})
				// Create overlap by including some of the previous chunk
				words := strings.Fields(currentChunk)
				if len(words) > overlap {
					currentChunk = strings.Join(words[len(words)-overlap:], " ") + " "
				} else {
					currentChunk = ""
				}
			}
		}
		currentChunk += sentence + " "
	}

	// Add any remaining content
	if strings.TrimSpace(currentChunk) != "" {
		chunks = append(chunks, &Document{
			Content:  strings.TrimSpace(currentChunk),
			Metadata: doc.Metadata,
		})
	}

	return chunks
}

func splitBySentence(text string) []string {
	var sentences []string
	var current strings.Builder

	for i, char := range text {
		current.WriteRune(char)

		if (char == '.' || char == '!' || char == '?') && i+1 < len(text) {
			if unicode.IsSpace(rune(text[i+1])) {
				sentences = append(sentences, current.String())
				current.Reset()
			}
		}
	}

	if current.Len() > 0 {
		sentences = append(sentences, current.String())
	}

	return sentences
}

This code loads a text file and splits it into manageable chunks. The overlap parameter helps ensure context isn’t lost between chunks - each chunk includes some words from the previous one.

Generating Embeddings

Now let’s create embeddings for our chunks. Embeddings are numerical representations of text that capture meaning:

package main

import (
	"context"
	"github.com/tmc/langchaingo/embeddings"
	"github.com/tmc/langchaingo/llms/ollama"
)

func generateEmbeddings(ctx context.Context, documents []*Document) ([][]float32, error) {
	// Initialize the Ollama embedding model
	embedder, err := ollama.NewEmbeddings(
		ollama.WithModel("nomic-embed-text"),
		ollama.WithBaseURL("http://localhost:11434"),
	)
	if err != nil {
		return nil, err
	}

	// Prepare texts for embedding
	texts := make([]string, len(documents))
	for i, doc := range documents {
		texts[i] = doc.Content
	}

	// Generate embeddings
	result, err := embedder.EmbedDocuments(ctx, texts)
	if err != nil {
		return nil, err
	}

	return result, nil
}

Each chunk of text gets converted into a vector (a list of numbers). The beauty of embeddings is that similar text gets similar vectors, which means we can use vector distance to find relevant chunks.

Storing Embeddings in a Vector Store

For this tutorial, we’ll use an in-memory vector store. In production, you’d use something like Pinecone, Weaviate, or Milvus, but for learning purposes, in-memory is perfect:

package main

import (
	"context"
	"math"
	"sort"
)

type VectorStore struct {
	vectors   [][]float32
	documents []*Document
	metadata  []map[string]interface{}
}

func NewVectorStore() *VectorStore {
	return &VectorStore{
		vectors:   [][]float32{},
		documents: []*Document{},
		metadata:  []map[string]interface{}{},
	}
}

func (vs *VectorStore) AddDocuments(documents []*Document, embeddings [][]float32) error {
	for i, doc := range documents {
		vs.documents = append(vs.documents, doc)
		vs.vectors = append(vs.vectors, embeddings[i])
		vs.metadata = append(vs.metadata, doc.Metadata)
	}
	return nil
}

func (vs *VectorStore) SimilaritySearch(ctx context.Context, queryVector []float32, k int) []*Document {
	type scoreDoc struct {
		score    float32
		document *Document
		index    int
	}

	var results []scoreDoc

	for i, vector := range vs.vectors {
		score := cosineSimilarity(queryVector, vector)
		results = append(results, scoreDoc{score, vs.documents[i], i})
	}

	// Sort by score descending
	sort.Slice(results, func(i, j int) bool {
		return results[i].score > results[j].score
	})

	// Return top k
	topK := k
	if topK > len(results) {
		topK = len(results)
	}

	var documents []*Document
	for i := 0; i < topK; i++ {
		documents = append(documents, results[i].document)
	}

	return documents
}

func cosineSimilarity(a, b []float32) float32 {
	var dotProduct float32
	var normA float32
	var normB float32

	for i := range a {
		dotProduct += a[i] * b[i]
		normA += a[i] * a[i]
		normB += b[i] * b[i]
	}

	if normA == 0 || normB == 0 {
		return 0
	}

	return dotProduct / (float32(math.Sqrt(float64(normA))) * float32(math.Sqrt(float64(normB))))
}

The SimilaritySearch function finds the most similar chunks to a query using cosine similarity - a fancy way of comparing vectors.

Putting It All Together

Now let’s create a complete RAG application that answers questions:

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/tmc/langchaingo/embeddings"
	"github.com/tmc/langchaingo/llms"
	"github.com/tmc/langchaingo/llms/ollama"
	"github.com/tmc/langchaingo/schema"
)

type RAGApp struct {
	vectorStore *VectorStore
	llm         llms.Model
	embedder    embeddings.Embeddings
}

func NewRAGApp(ctx context.Context) (*RAGApp, error) {
	// Initialize the LLM
	llm, err := ollama.New(
		ollama.WithModel("mistral"),
		ollama.WithBaseURL("http://localhost:11434"),
	)
	if err != nil {
		return nil, err
	}

	// Initialize embeddings
	embedder, err := ollama.NewEmbeddings(
		ollama.WithModel("nomic-embed-text"),
		ollama.WithBaseURL("http://localhost:11434"),
	)
	if err != nil {
		return nil, err
	}

	return &RAGApp{
		vectorStore: NewVectorStore(),
		llm:         llm,
		embedder:    embedder,
	}, nil
}

func (app *RAGApp) AddDocuments(ctx context.Context, documents []*Document) error {
	// Generate embeddings for all documents
	texts := make([]string, len(documents))
	for i, doc := range documents {
		texts[i] = doc.Content
	}

	embeddingResults, err := app.embedder.EmbedDocuments(ctx, texts)
	if err != nil {
		return fmt.Errorf("failed to generate embeddings: %w", err)
	}

	// Store documents and embeddings
	return app.vectorStore.AddDocuments(documents, embeddingResults)
}

func (app *RAGApp) Query(ctx context.Context, question string) (string, error) {
	// Generate embedding for the question
	questionEmbeddings, err := app.embedder.EmbedQuery(ctx, question)
	if err != nil {
		return "", fmt.Errorf("failed to embed question: %w", err)
	}

	// Retrieve relevant documents
	relevantDocs := app.vectorStore.SimilaritySearch(ctx, questionEmbeddings, 3)

	// Build context from relevant documents
	context := "Here is relevant information:\n\n"
	for i, doc := range relevantDocs {
		context += fmt.Sprintf("[Document %d]: %s\n\n", i+1, doc.Content)
	}

	// Create the prompt
	prompt := fmt.Sprintf(`%sQuestion: %s

Please answer the question based on the information provided above.`, context, question)

	// Call the LLM
	messages := []llms.MessageContent{
		{
			Role: llms.ChatMessageTypeHuman,
			Parts: []llms.ContentPart{
				llms.TextContent{Text: prompt},
			},
		},
	}

	response, err := app.llm.GenerateContent(ctx, messages)
	if err != nil {
		return "", fmt.Errorf("failed to generate response: %w", err)
	}

	if len(response.Choices) == 0 {
		return "", fmt.Errorf("no response from LLM")
	}

	answer := ""
	for _, part := range response.Choices[0].Content {
		if textPart, ok := part.(llms.TextContent); ok {
			answer += textPart.Text
		}
	}

	return answer, nil
}

func main() {
	ctx := context.Background()

	// Initialize RAG app
	app, err := NewRAGApp(ctx)
	if err != nil {
		log.Fatalf("Failed to initialize RAG app: %v", err)
	}

	// Load and prepare documents
	doc, err := loadDocument("documents.txt")
	if err != nil {
		log.Fatalf("Failed to load document: %v", err)
	}

	chunks := splitDocument(doc, 500, 50)
	fmt.Printf("Split document into %d chunks\n", len(chunks))

	// Add documents to the RAG system
	err = app.AddDocuments(ctx, chunks)
	if err != nil {
		log.Fatalf("Failed to add documents: %v", err)
	}

	// Ask questions
	questions := []string{
		"What is the main topic of these documents?",
		"Can you explain the key concepts?",
	}

	for _, question := range questions {
		fmt.Printf("\nQ: %s\n", question)
		answer, err := app.Query(ctx, question)
		if err != nil {
			log.Printf("Error: %v", err)
			continue
		}
		fmt.Printf("A: %s\n", answer)
	}
}

This is your complete RAG system! It loads documents, splits them, generates embeddings, stores them, and answers questions based on the retrieved context.

Tips for Production

Before you take this to production, here are some things to think about:

Chunk Size and Overlap: The 500 character chunks with 50 word overlap we used work okay, but this depends on your domain. Try different sizes and measure what works best for your use case. Generally, 300-800 characters is a good starting point.

Embedding Model Selection: We used nomic-embed-text, which is lightweight and works well. If you need better quality, try mxbai-embed-large or use OpenAI embeddings. There’s always a tradeoff between quality and speed.

Vector Store Choice: In-memory is great for development, but for production with lots of documents, use a real vector database. Pinecone, Weaviate, and Milvus are all solid choices. They handle scaling, persistence, and complex queries much better.

Chunk Strategy: Instead of just splitting by character count, try splitting by semantic meaning or document structure. Keeping related information together in chunks improves retrieval quality.

Retrieval Strategy: Retrieving top-3 documents is fine for simple cases, but experiment with the number. More isn’t always better - too much context can confuse the LLM.

Prompt Engineering: The prompt we used is basic. Spend time refining how you format retrieved documents and instructions to the LLM. Small changes can have big impacts on quality.

Evaluation: In production, measure performance. Track how often the LLM answers correctly based on the retrieved documents. This gives you visibility into whether your RAG system is actually working.

What’s Next?

You now have the foundation for building RAG applications in Go! From here, you could:

  • Connect to a real vector database instead of in-memory storage
  • Support different document formats (PDFs, Word docs, etc.)
  • Add filters to limit which documents get searched
  • Implement conversation history so the LLM can handle follow-up questions
  • Build a web API around your RAG system
  • Experiment with different LLMs and embedding models

The RAG pattern is incredibly powerful because it keeps your AI grounded in reality while keeping costs down. Start simple like we did here, then add complexity as you need it.

Happy building!