Odin · strings & runes

A string, four levels deep

A string is a 16-byte header that points at a run of UTF-8 bytes — a data pointer plus a byte count. The header is not the text; it's where the text lives and how many bytes long it is. The bytes themselves are read-only. Pick a string and watch its bytes light up.

Real bytes & sizes, measured from a compiled Odin program. The data pointer differs every run, so it's shown as one representative run; the length and the UTF-8 bytes are stable.

s : string — 16 bytes

data → 0x… (8 bytes)

len = … (8 bytes)

›

the UTF-8 byte buffer

Sizes (size_of)

string  = 16 (ptr 8 + len 8)
rune    = 4 (one code point)
u8      = 1 (one byte)

The 16 is fixed no matter how long the text is — only the buffer it points at grows.

What you're seeing

Level 1 was what it holds. Level 2 is the consequence that bites: the length is a count of bytes, and one character can be several bytes. So len(s) is not the number of characters, and stepping through a string is stepping through bytes.

In "Aé猫🐈" the four characters take 1, 2, 3, and 4 bytes — ten bytes for four characters. len(s) is 10; the actual character count (utf8.rune_count) is 4. The decoder reads one character at a time, and a rune — Odin's 4-byte type for a single Unicode code point — is what you get for each. Step the loop and watch the byte index jump by the width of the character just read:

The mechanism: a for r, i in s loop is a decoder, not a counter. Each step reads one UTF-8 character, hands you the decoded rune in r and the byte offset where it started in i, then advances i by however many bytes that character occupied — 1 for A, then 2 for é, so i goes 0 → 1 → 3 → 6, never landing on 2, 4, 5, 7, 8, or 9. Those skipped indices are the continuation bytes in the middle of a character. The loop knows the width from the leading byte's bit pattern, so it always lands on a character boundary.

the trap: s[i] is a byte, not a character Indexing reaches a single byte, typed u8 — s[i] is "give me byte i", never "give me character i". For "café" (5 bytes), the é lives in bytes 3 and 4 as the pair 195, 169; index either one and you get half of a character, a number that is no letter at all. When you want characters, you decode (the for r, i in s loop, or utf8.rune_at); when you want raw storage, you index. They are different questions with different answers.

Level 2 told you indexing is by byte. Level 3 is the bill that follows: because a slice is a byte range, it can cut a character in half — and because the bytes are read-only, you cannot edit a string in place. Two sharp edges, both caught honestly.

Edge 1 — a byte slice can land mid-character. Slicing is s[lo:hi] by byte offset. Slice "café"[0:4] and you keep c a f plus the first half of é — a dangling lead byte 0xc3 with its partner left behind. That orphan is not a valid character; decoding it gives U+FFFD, the replacement character. The slice is silent about it — the cut is lossy and nothing warns you:

s := "café"           // 5 bytes: 63 61 66 c3 a9   (é is c3 a9)
cut := s[0:4]         // keeps 63 61 66 c3 — half of é

len(cut) = 4
last byte of cut = 0xc3   // a lead byte with no continuation
decode of the orphan: U+FFFD  valid=false

The fix is to cut on boundaries you found by decoding — the byte index i from a for r, i in s loop is always a valid character boundary, so slicing at one of those never splits a character.

Edge 2 — the bytes are read-only. A string is a view; you cannot write through it. Try to overwrite a byte and the build stops before it ever runs:

s := "héllo"
s[0] = 'H'

Error: Cannot assign to 's[0]'

The fix: to change text you build a new buffer. strings.concatenate, strings.clone, and the builder in core:strings each allocate a fresh run of bytes through context.allocator and hand you back a string pointing at it — which you then delete when done (lesson 07b's defer pairs with this perfectly). The original is never mutated; you produce a replacement.

the quiet edge: the bytes are owned by someone A string's header points at a buffer it does not own — the bytes might live in the program binary (a literal), in an allocator's memory, or inside another buffer. The string carries no lifetime; if you hold one past the point its buffer is freed or reused, the pointer dangles and you read whatever is there now. The safe move when a string must outlive its source is to strings.clone a copy you own and defer delete it.

The last level is the payoff: because the header carries the length, the string already knows where it ends — and that one stored number is what makes correct text handling cheap and the boundary type, cstring, the rare exception rather than the rule.

Carrying the byte count in the header means three things come for free. One: the end is known without scanning — no walking the bytes looking for a terminator, because the length is right there. Two: a slice s[lo:hi] is just a new (pointer, length) pair aimed into the same buffer — no copy, and the sub-string knows its own length too. Three: the for r, i in s decoder uses that length as its hard stop, so iteration is bounded and lands on every character boundary correctly, including the multi-byte ones from Level 2.

The one place the length is not there is cstring — the NUL-terminated variant you use only at a foreign-library boundary. It is 8 bytes, just a pointer: half the size of a string, because it drops the length word and instead marks the end with a trailing NUL byte. The cost shows up the moment you ask its length — there's no stored count, so len has to scan forward to the NUL to find it:

the two string types, to scale (size_of)

string

16 bytes = ptr(8) + len(8)

cstring

8 bytes = ptr only

len(c) on a cstring scans to the NUL → 6 for "héllo"; round-tripped to a real string the length (6) now sits in the header, no scan needed.

The emergent payoff: the stored length is the whole reason a string in Odin is a safe, cheap value to pass around. You hand a procedure a string and it knows the exact extent of the text — it can iterate it, slice it, and never run off the end, all without scanning and without copying. The terminator-based form still exists for talking to outside libraries that expect it, but it is the special case you reach for deliberately, not the default — the default already knows how long it is.

That's the arc: L1 a string is a 16-byte header — a pointer plus a byte count — over read-only UTF-8 → L2 that count is bytes, and a character can span several, so iteration decodes and the index jumps → L3 the bill: a byte slice can cut a character in half, and the bytes can't be edited in place → L4 the stored length is the payoff — the end is known, slices are free, iteration is bounded; cstring is the 8-byte terminator-based exception for foreign boundaries.

probes reproduce with odin run · sizes, bytes, the rune walk & the L3 errors are real compiler output (claims/lessons/04-strings-and-runes)