Extracting a Substring in Go

In Go, you can extract substrings using slice notation with the syntax string[start:end], where start is inclusive and end is exclusive. This approach works directly on strings and is very efficient for ASCII text. However, when working with Unicode characters, byte-based slicing can produce incorrect results because multi-byte characters may be split in the middle.

For Unicode-safe substring extraction, convert the string to a rune slice first, perform the slicing operation, and then convert back to a string. This ensures that each character, regardless of its byte length, is treated as a single unit. The standard slice notation returns a new string without modifying the original, since strings in Go are immutable.

Here’s a comprehensive example showing both basic and Unicode-safe substring extraction:

package main

import (
  "fmt"
)

func main() {
  // Basic ASCII substring extraction
  text := "programming"
  fmt.Println(text[0:3])   // Output: pro
  fmt.Println(text[3:9])   // Output: gram
  fmt.Println(text[9:])    // Output: ing (from position 9 to end)
  fmt.Println(text[:4])    // Output: prog (from start to position 4)

  // Unicode-safe substring extraction
  unicode := "Hello 世界 🚀"
  runes := []rune(unicode)

  // Extract first 6 characters (including space and Chinese)
  substring := string(runes[0:6])
  fmt.Println(substring) // Output: Hello 世

  // Extract characters 6-8
  substring2 := string(runes[6:8])
  fmt.Println(substring2) // Output: 界

  // Helper function for safe substring extraction
  safeSubstring := func(s string, start, end int) string {
    runes := []rune(s)
    if start < 0 {
      start = 0
    }
    if end > len(runes) {
      end = len(runes)
    }
    if start > end {
      return ""
    }
    return string(runes[start:end])
  }

  fmt.Println(safeSubstring("Привет мир", 0, 6)) // Output: Привет
}

Choosing between byte-based and rune-based substring extraction depends on your use case. For simple ASCII text, byte slicing is faster and more efficient. For internationalized applications or when processing user-generated content, always use rune-based extraction to ensure correct handling of all Unicode characters.

###Further Reading:

Further Reading: