A reference frame is a previously decoded frame that the codec can use to predict future frames. When a P-frame says "this block is similar to a block from 200 ms ago, shifted slightly to the left", the "frame from 200 ms ago" is the reference. Reference frames live in the decoder's memory (the dpb — Decoded Picture Buffer) for as long as future frames might need them, then get discarded to make room for new ones.
Modern codecs allow multiple reference frames simultaneously — H.264 up to 16 in High profile, HEVC similar, AV1 up to 8 named references. The encoder freely picks which one (or two, for B-frames) best matches the block it's trying to encode. More reference options mean better predictions, smaller residuals and smaller files. The trade-off: every reference frame the encoder keeps active consumes decoder memory, which is why level constraints cap the maximum number of references for each resolution tier.
For a product team, two practical insights about reference frames. First, encoder presets like slow and veryslow use more references and find better matches; that's part of why slow presets compress better. Second, codec "intelligence" lives largely in reference frame selection — picking the right reference from a pool of 16 candidates is a key part of mode decision, and it's where modern AI-driven encoders (and traditional psycho-visual heuristics) make smart choices that simple math wouldn't. If you ever see a video pipeline mysteriously producing smaller files at the same VMAF after an encoder update, reference frame selection improvements are usually responsible.

