Building RAG Applications in Go
If you’ve been keeping up with the AI world lately, you’ve probably heard about Large Language Models (LLMs) being amazing but also having a bit of a problem: they can make stuff up. They’ll confidently tell you facts that are completely wrong. This is where RAG comes in, and it’s honestly one of the most practical ways to make LLMs actually useful for your specific domain.
In this guide, we’re going to build a RAG application in Go that can answer questions based on documents you provide. By the end, you’ll have a working system that retrieves relevant information from your documents and uses an LLM to generate accurate answers.
What is RAG and Why Should You Care?
RAG stands for Retrieval Augmented Generation. Instead of just asking an LLM a question and hoping it knows the answer, we give it the relevant documents first. It’s like handing someone a research paper before asking them to explain what’s in it.
Here’s why this matters:
Grounding in Reality: LLMs are trained on general knowledge, but they don’t know about your proprietary data, internal documentation, or domain-specific information. RAG solves this by letting you inject your own context into the conversation.
Reducing Hallucinations: When an LLM doesn’t have information, it makes things up (we call this hallucinating). By providing relevant documents, we dramatically reduce false information in the responses.
Cost Efficient: Instead of fine-tuning a massive model on your data (expensive!), we just retrieve what’s relevant and include it in the prompt. This keeps costs down while maintaining quality.
How RAG Works: The Pipeline
RAG has a pretty straightforward pipeline:
- Document Loading: Read your documents (PDFs, text files, web pages, etc.)
- Splitting: Break documents into manageable chunks (LLMs have context limits)
- Embedding: Convert text chunks into vector representations
- Storage: Store these vectors in a vector database
- Retrieval: When a user asks a question, find similar chunks using vector similarity
- Generation: Pass the retrieved chunks + original question to an LLM to generate an answer
The magic happens because vectors let us find “similar” content efficiently. Even if your question uses different words than your documents, the vectors understand the meaning.
Setting Up Your Project
Let’s start by creating a new Go project. We’ll use LangChainGo (a Go port of LangChain) and Ollama for running LLMs locally.
First, make sure you have Go installed, then create a new project:
mkdir rag-app && cd rag-app
go mod init github.com/yourusername/rag-app
Now let’s add our dependencies:
go get github.com/tmc/langchaingo
go get github.com/tmc/langchaingo/llms/ollama
go get github.com/tmc/langchaingo/embeddings
go get github.com/tmc/langchaingo/vectorstores/memorystore
You’ll also need Ollama installed and running. If you haven’t already, download it from ollama.ai. Once installed, pull a model:
ollama pull mistral
ollama pull nomic-embed-text
The first is our LLM, the second is our embedding model. You can use different models if you prefer - Ollama has lots available.
Loading and Splitting Documents
Let’s start with a simple function to load documents. For this example, we’ll work with text files, which keeps things straightforward:
package main
import (
"fmt"
"io/ioutil"
"strings"
"unicode"
)
type Document struct {
Content string
Metadata map[string]interface{}
}
func loadDocument(filePath string) (*Document, error) {
content, err := ioutil.ReadFile(filePath)
if err != nil {
return nil, fmt.Errorf("failed to read file: %w", err)
}
return &Document{
Content: string(content),
Metadata: map[string]interface{}{
"source": filePath,
},
}, nil
}
func splitDocument(doc *Document, chunkSize int, overlap int) []*Document {
var chunks []*Document
content := doc.Content
// Split by sentences first for cleaner breaks
sentences := splitBySentence(content)
currentChunk := ""
for _, sentence := range sentences {
if len(currentChunk) + len(sentence) > chunkSize {
if currentChunk != "" {
chunks = append(chunks, &Document{
Content: strings.TrimSpace(currentChunk),
Metadata: doc.Metadata,
})
// Create overlap by including some of the previous chunk
words := strings.Fields(currentChunk)
if len(words) > overlap {
currentChunk = strings.Join(words[len(words)-overlap:], " ") + " "
} else {
currentChunk = ""
}
}
}
currentChunk += sentence + " "
}
// Add any remaining content
if strings.TrimSpace(currentChunk) != "" {
chunks = append(chunks, &Document{
Content: strings.TrimSpace(currentChunk),
Metadata: doc.Metadata,
})
}
return chunks
}
func splitBySentence(text string) []string {
var sentences []string
var current strings.Builder
for i, char := range text {
current.WriteRune(char)
if (char == '.' || char == '!' || char == '?') && i+1 < len(text) {
if unicode.IsSpace(rune(text[i+1])) {
sentences = append(sentences, current.String())
current.Reset()
}
}
}
if current.Len() > 0 {
sentences = append(sentences, current.String())
}
return sentences
}
This code loads a text file and splits it into manageable chunks. The overlap parameter helps ensure context isn’t lost between chunks - each chunk includes some words from the previous one.
Generating Embeddings
Now let’s create embeddings for our chunks. Embeddings are numerical representations of text that capture meaning:
package main
import (
"context"
"github.com/tmc/langchaingo/embeddings"
"github.com/tmc/langchaingo/llms/ollama"
)
func generateEmbeddings(ctx context.Context, documents []*Document) ([][]float32, error) {
// Initialize the Ollama embedding model
embedder, err := ollama.NewEmbeddings(
ollama.WithModel("nomic-embed-text"),
ollama.WithBaseURL("http://localhost:11434"),
)
if err != nil {
return nil, err
}
// Prepare texts for embedding
texts := make([]string, len(documents))
for i, doc := range documents {
texts[i] = doc.Content
}
// Generate embeddings
result, err := embedder.EmbedDocuments(ctx, texts)
if err != nil {
return nil, err
}
return result, nil
}
Each chunk of text gets converted into a vector (a list of numbers). The beauty of embeddings is that similar text gets similar vectors, which means we can use vector distance to find relevant chunks.
Storing Embeddings in a Vector Store
For this tutorial, we’ll use an in-memory vector store. In production, you’d use something like Pinecone, Weaviate, or Milvus, but for learning purposes, in-memory is perfect:
package main
import (
"context"
"math"
"sort"
)
type VectorStore struct {
vectors [][]float32
documents []*Document
metadata []map[string]interface{}
}
func NewVectorStore() *VectorStore {
return &VectorStore{
vectors: [][]float32{},
documents: []*Document{},
metadata: []map[string]interface{}{},
}
}
func (vs *VectorStore) AddDocuments(documents []*Document, embeddings [][]float32) error {
for i, doc := range documents {
vs.documents = append(vs.documents, doc)
vs.vectors = append(vs.vectors, embeddings[i])
vs.metadata = append(vs.metadata, doc.Metadata)
}
return nil
}
func (vs *VectorStore) SimilaritySearch(ctx context.Context, queryVector []float32, k int) []*Document {
type scoreDoc struct {
score float32
document *Document
index int
}
var results []scoreDoc
for i, vector := range vs.vectors {
score := cosineSimilarity(queryVector, vector)
results = append(results, scoreDoc{score, vs.documents[i], i})
}
// Sort by score descending
sort.Slice(results, func(i, j int) bool {
return results[i].score > results[j].score
})
// Return top k
topK := k
if topK > len(results) {
topK = len(results)
}
var documents []*Document
for i := 0; i < topK; i++ {
documents = append(documents, results[i].document)
}
return documents
}
func cosineSimilarity(a, b []float32) float32 {
var dotProduct float32
var normA float32
var normB float32
for i := range a {
dotProduct += a[i] * b[i]
normA += a[i] * a[i]
normB += b[i] * b[i]
}
if normA == 0 || normB == 0 {
return 0
}
return dotProduct / (float32(math.Sqrt(float64(normA))) * float32(math.Sqrt(float64(normB))))
}
The SimilaritySearch function finds the most similar chunks to a query using cosine similarity - a fancy way of comparing vectors.
Putting It All Together
Now let’s create a complete RAG application that answers questions:
package main
import (
"context"
"fmt"
"log"
"github.com/tmc/langchaingo/embeddings"
"github.com/tmc/langchaingo/llms"
"github.com/tmc/langchaingo/llms/ollama"
"github.com/tmc/langchaingo/schema"
)
type RAGApp struct {
vectorStore *VectorStore
llm llms.Model
embedder embeddings.Embeddings
}
func NewRAGApp(ctx context.Context) (*RAGApp, error) {
// Initialize the LLM
llm, err := ollama.New(
ollama.WithModel("mistral"),
ollama.WithBaseURL("http://localhost:11434"),
)
if err != nil {
return nil, err
}
// Initialize embeddings
embedder, err := ollama.NewEmbeddings(
ollama.WithModel("nomic-embed-text"),
ollama.WithBaseURL("http://localhost:11434"),
)
if err != nil {
return nil, err
}
return &RAGApp{
vectorStore: NewVectorStore(),
llm: llm,
embedder: embedder,
}, nil
}
func (app *RAGApp) AddDocuments(ctx context.Context, documents []*Document) error {
// Generate embeddings for all documents
texts := make([]string, len(documents))
for i, doc := range documents {
texts[i] = doc.Content
}
embeddingResults, err := app.embedder.EmbedDocuments(ctx, texts)
if err != nil {
return fmt.Errorf("failed to generate embeddings: %w", err)
}
// Store documents and embeddings
return app.vectorStore.AddDocuments(documents, embeddingResults)
}
func (app *RAGApp) Query(ctx context.Context, question string) (string, error) {
// Generate embedding for the question
questionEmbeddings, err := app.embedder.EmbedQuery(ctx, question)
if err != nil {
return "", fmt.Errorf("failed to embed question: %w", err)
}
// Retrieve relevant documents
relevantDocs := app.vectorStore.SimilaritySearch(ctx, questionEmbeddings, 3)
// Build context from relevant documents
context := "Here is relevant information:\n\n"
for i, doc := range relevantDocs {
context += fmt.Sprintf("[Document %d]: %s\n\n", i+1, doc.Content)
}
// Create the prompt
prompt := fmt.Sprintf(`%sQuestion: %s
Please answer the question based on the information provided above.`, context, question)
// Call the LLM
messages := []llms.MessageContent{
{
Role: llms.ChatMessageTypeHuman,
Parts: []llms.ContentPart{
llms.TextContent{Text: prompt},
},
},
}
response, err := app.llm.GenerateContent(ctx, messages)
if err != nil {
return "", fmt.Errorf("failed to generate response: %w", err)
}
if len(response.Choices) == 0 {
return "", fmt.Errorf("no response from LLM")
}
answer := ""
for _, part := range response.Choices[0].Content {
if textPart, ok := part.(llms.TextContent); ok {
answer += textPart.Text
}
}
return answer, nil
}
func main() {
ctx := context.Background()
// Initialize RAG app
app, err := NewRAGApp(ctx)
if err != nil {
log.Fatalf("Failed to initialize RAG app: %v", err)
}
// Load and prepare documents
doc, err := loadDocument("documents.txt")
if err != nil {
log.Fatalf("Failed to load document: %v", err)
}
chunks := splitDocument(doc, 500, 50)
fmt.Printf("Split document into %d chunks\n", len(chunks))
// Add documents to the RAG system
err = app.AddDocuments(ctx, chunks)
if err != nil {
log.Fatalf("Failed to add documents: %v", err)
}
// Ask questions
questions := []string{
"What is the main topic of these documents?",
"Can you explain the key concepts?",
}
for _, question := range questions {
fmt.Printf("\nQ: %s\n", question)
answer, err := app.Query(ctx, question)
if err != nil {
log.Printf("Error: %v", err)
continue
}
fmt.Printf("A: %s\n", answer)
}
}
This is your complete RAG system! It loads documents, splits them, generates embeddings, stores them, and answers questions based on the retrieved context.
Tips for Production
Before you take this to production, here are some things to think about:
Chunk Size and Overlap: The 500 character chunks with 50 word overlap we used work okay, but this depends on your domain. Try different sizes and measure what works best for your use case. Generally, 300-800 characters is a good starting point.
Embedding Model Selection: We used nomic-embed-text, which is lightweight and works well. If you need better quality, try mxbai-embed-large or use OpenAI embeddings. There’s always a tradeoff between quality and speed.
Vector Store Choice: In-memory is great for development, but for production with lots of documents, use a real vector database. Pinecone, Weaviate, and Milvus are all solid choices. They handle scaling, persistence, and complex queries much better.
Chunk Strategy: Instead of just splitting by character count, try splitting by semantic meaning or document structure. Keeping related information together in chunks improves retrieval quality.
Retrieval Strategy: Retrieving top-3 documents is fine for simple cases, but experiment with the number. More isn’t always better - too much context can confuse the LLM.
Prompt Engineering: The prompt we used is basic. Spend time refining how you format retrieved documents and instructions to the LLM. Small changes can have big impacts on quality.
Evaluation: In production, measure performance. Track how often the LLM answers correctly based on the retrieved documents. This gives you visibility into whether your RAG system is actually working.
What’s Next?
You now have the foundation for building RAG applications in Go! From here, you could:
- Connect to a real vector database instead of in-memory storage
- Support different document formats (PDFs, Word docs, etc.)
- Add filters to limit which documents get searched
- Implement conversation history so the LLM can handle follow-up questions
- Build a web API around your RAG system
- Experiment with different LLMs and embedding models
The RAG pattern is incredibly powerful because it keeps your AI grounded in reality while keeping costs down. Start simple like we did here, then add complexity as you need it.
Happy building!
Continue Learning
Building AI Agents in Go
Learn how to build AI agents in Go that can use tools, make decisions, and complete tasks autonomously.
Building AI Applications with LangChainGo
Learn how to build AI-powered applications in Go using the LangChainGo library with practical examples.
Calling Ollama from a Go Application
Learn how to interact with Ollama's REST API from Go to build AI-powered applications.
Getting Started with Ollama - Running LLMs Locally
In this tutorial, we'll look at how you can get Ollama set up on your machine and start running large language models locally.