Published 2026-06-02 · 16 min read · By Nikolay Sapunov, CEO at Fora Soft

Why This Matters

In the previous lesson we drew the agent loop — perceive, reason, act, observe — and showed why video products lean on it. This lesson opens the engine and names the three parts that make the loop actually run. If you are scoping an agent feature for conferencing, surveillance, OTT, e-learning, or telemedicine, these are the parts your vendor or your engineers will argue about, budget for, and get wrong. You do not need to write the code, but you do need to know what a "tool," a "memory store," and a "planning pattern" are well enough to ask whether the design fits the job. By the end you will be able to look at any agent proposal and check that all three primitives are present, sized correctly, and not over-built.

The Three Powers A Plain Model Lacks

A language model on its own is a closed box. It reads the text you give it and writes text back. It cannot open a file, run a detector on a video, look up what it did yesterday, or decide to take five steps instead of one. To turn that closed box into an agent — a system that pursues a goal on its own — you bolt on three capabilities. Each one fills a specific gap.

The first power, called tool use, lets the model act in the world outside its own text. The second power, called memory, lets it keep and recall information beyond the single block of text in front of it. The third power, called planning, lets it break a goal into steps and decide their order. Take any one away and the agent breaks: no tools and it can only talk; no memory and it forgets the moment a task gets long; no planning and it cannot handle anything that takes more than one move. Everything else in agent engineering — frameworks, orchestration, evaluation — is built on top of these three.

Three labeled pillars — tool use, memory, and planning — standing under the agent loop, each pillar showing its one-line job and a video example, illustrating that all three are required to turn a model into an agent Figure 1. The three primitives sit under the agent loop from Lesson 7.1: tool use reaches the world, memory carries knowledge, planning orders the steps. Remove any one and the agent stops working.

Tool Use: How An Agent Reaches Outside Itself

A tool is any function the agent can call to do something the model cannot do by writing text — run an object detector, transcribe audio, query a database, send a webhook, search a clip archive. The mechanism that makes this work has a name that sounds technical but describes something simple: function calling, also called tool calling.

Here is the whole idea in one pass. You describe each tool to the model the way you would describe a form to a new hire: a name (run_person_detector), a one-line description of what it does, and a list of the inputs it needs with their types — a camera ID (text), a start time and end time (timestamps), a confidence threshold (a number between 0 and 1). That description of inputs is written in a format called JSON Schema, which is just a precise way to spell out "this field is a number, that one is text, this one is required." When the model decides to use the tool, it does not run anything itself. It fills in the form — it produces a small block of structured text naming the tool and giving values for each input. Your code reads that block, runs the real function, and hands the result back to the model as the next observation. The model never touches the detector; it only fills in the form and reads what came back.

The reliability of this hinges on the model filling the form out correctly, and in 2026 that is largely a solved problem. Both OpenAI and Anthropic offer a structured outputs mode that constrains generation at the token level so the model cannot omit a required field or invent an invalid value — Anthropic reports schema validity above 99% with constrained decoding. That matters for a video pipeline, where a malformed timestamp does not just produce a bad sentence; it sends a detector to the wrong hour of footage and wastes a GPU-minute. The practical rule: every tool gets a strict schema, and your code still validates the result before acting on it.

For most of the technology's short history, wiring a tool to an agent meant writing a custom adapter for each one — one integration for your detector, another for your transcription service, another for your database. That changed with the Model Context Protocol (MCP), an open standard Anthropic introduced in November 2024 to make tools speak one common language. Instead of a bespoke adapter per tool, a tool is exposed once as an MCP "server," and any MCP-aware agent can discover it, read its schema, and call it. Adoption moved fast: OpenAI adopted MCP in March 2025, and by early 2026 there were on the order of ten thousand public MCP servers. In December 2025 Anthropic donated the protocol to a neutral foundation under the Linux Foundation, co-founded with Block and OpenAI, signaling that MCP is now shared industry plumbing rather than one vendor's idea. For a product team, the takeaway is concrete: ask whether a proposed agent uses MCP for its tools, because the answer tells you how much custom integration code you are signing up to maintain.

A round-trip diagram of a tool call: the model emits a structured call naming a tool and its JSON arguments, the runtime executes the real function such as a detector or ASR service, and the observation returns to the model, with an MCP layer shown standardizing the tool connections Figure 2. Anatomy of a tool call. The model fills in a form (tool name + JSON arguments); the runtime executes the real function; the result returns as the next observation. MCP standardizes how tools are wired in.

A common and expensive mistake here is tool overload: handing the agent forty tools because each one seemed useful. The more tools an agent can choose from, the more often it picks the wrong one or stalls deciding, and accuracy on tool selection drops as the menu grows. A video investigator agent needs a handful of sharp tools — a detector, a transcription call, a clip-retrieval query, a report writer — not a kitchen drawer. Start with the smallest tool set that covers the job and add a tool only when a real task proves it is missing.

Memory: Why The Model Forgets, And What To Do About It

A model reads everything inside one block of text called its context window — the span of words it can hold in view at once. Whatever is in the window is all the model knows in that moment; whatever falls out is gone. This is the model's working memory, and it is both fast and very small relative to a real video workload. That single fact drives every memory decision in agent design.

Do the arithmetic once and the problem is obvious. A model counts text in tokens, where one token is roughly three-quarters of a word, so a word is about 1.33 tokens. A one-hour meeting transcribed at a normal speaking rate of about 150 words per minute produces 150 × 60 = 9,000 words, which is 9,000 × 1.33 ≈ 12,000 tokens. A large 2026 model with a 200,000-token window can therefore hold roughly 200,000 ÷ 1.33 ≈ 150,000 words, or about sixteen hours of transcript. That sounds like a lot until you point an agent at a week of recorded meetings — forty hours, or about 480,000 tokens — which overflows the window by more than double. Paste-it-all-in does not scale. The footage and the transcripts of a real video product are always larger than the window.

The way out is to keep only the active task in the window and store everything else outside the model, fetching the relevant piece when it is needed. Researchers have organized agent memory into four kinds, borrowed from how human memory was mapped in cognitive science and formalized for language agents in the 2023 CoALA framework from Princeton. Defining each in plain terms, before the label:

The memory of what is happening right now — the current goal, the last few observations, the running plan — is working memory. It lives in the context window and vanishes when the task ends. The memory of specific past events — "on Tuesday the loading-bay camera flagged a person at 23:14" — is episodic memory. It is written to an external store and recalled across sessions. The memory of general facts that do not belong to one event — the facility's camera map, the list of known staff faces, the catalog of objects the system can recognize — is semantic memory, usually kept in a searchable database. And the memory of how to do things — the standing investigation procedure, the rules the agent always follows — is procedural memory, which lives in the agent's system instructions or is baked in through training.

Memory type What it holds Where it lives Lifespan Video example
Working The current task, goal, last steps Context window This task only The investigation in progress
Episodic Specific past events External store (often vector DB) Across sessions "Person flagged at camera 3, 23:14 Tuesday"
Semantic General facts and knowledge Searchable database Long-lived Camera map, known faces, object catalog
Procedural How to perform the task System prompt or model weights Persistent The standing investigation playbook

The engineering pattern that connects these is retrieval. The long-term memories sit in an external store; when the agent needs them, a search pulls back only the handful of relevant items and drops them into the working window for that turn. This is the same retrieval machinery used for video archives, covered in the lesson on video RAG over an archive — memory and retrieval-augmented generation are the same idea pointed at different data. One influential 2023 system, MemGPT, framed the whole problem as an operating system would: the context window is fast "main memory," the external store is slower "disk," and the agent pages information between them as needed. You do not need that design to remember the analogy it teaches — a small fast memory plus a large slow one, with something deciding what to load.

The pitfall to avoid is the mirror image of tool overload: context stuffing, the urge to cram the whole archive, every past transcript, and a long instruction manual into the window "so the agent has everything." It backfires twice. It costs money, because you pay per token on every single call, and it costs accuracy, because models attend less reliably to information buried in a very long context. Keep the window lean — the task and the few retrieved facts that bear on it — and let the external store hold the rest.

Planning: Turning A Goal Into Ordered Steps

Planning is the power to take a goal stated in one sentence — "tell me if anyone entered the loading bay after 22:00 last night" — and produce the ordered sequence of actions that answers it. Three planning shapes have proven themselves in production, and they differ in when the thinking happens.

The first, ReAct, interleaves thinking and acting one step at a time: the agent reasons about the next move, takes it, observes the result, then reasons about the move after that. It was formalized in the 2022 ReAct paper and is the default for interactive work where the next step truly depends on what the last step found. A surveillance investigation fits this well — you cannot plan which clip to inspect closely until the detector tells you which clips exist.

The second, plan-and-execute, separates the thinking from the doing: a planner writes the full step list up front, then an executor runs the steps in order, replanning only if something breaks. Because it does not call the expensive reasoning model before every single action, it is cheaper and faster on tasks whose shape is known in advance. An overnight job that ingests, transcribes, tags, and indexes a batch of recordings is a planning-then-execution task — the steps are the same for every file, so decide them once.

The third, reflection (the pattern introduced by the 2023 Reflexion paper), adds a self-critique step: the agent produces a result, judges its own work against the goal, writes down what was wrong, and tries again with that critique in mind. It costs the most because it runs extra passes, so you reserve it for tasks where a wrong final answer is expensive and there is a clear way to check quality — a compliance report drawn from footage, where being wrong has consequences and the claims can be verified against the clips.

Planning pattern When the thinking happens Best for Relative cost Video example
ReAct Before every step, one at a time Interactive tasks; next step depends on the last Medium Live archive investigation
Plan-and-execute All up front, then execute Known, repeatable multi-step jobs Lower Overnight batch ingest + tagging
Reflection After a result, then retry High-stakes output with a quality check Higher Compliance report from footage

Underneath all three sits one shared skill: task decomposition, the breaking of a big goal into smaller pieces small enough to act on. "Investigate the loading bay" becomes "list cameras covering the bay," then "run the person detector on each between 22:00 and 06:00," then "inspect the candidate clips," then "write the finding." The planning pattern decides whether those pieces are produced all at once or one at a time, but the decomposition is what makes a vague goal executable at all. A useful test when reviewing an agent design: ask to see the decomposition for a real task. If no one can write down the steps, the agent will not be able to either.

Three side-by-side flows comparing the planning patterns — ReAct interleaving reason and act step by step, plan-and-execute producing a full step list then running it, and reflection adding a self-critique and retry loop — each tagged with the kind of video task it suits best Figure 3. Three planning patterns. ReAct decides step by step; plan-and-execute decides everything up front; reflection critiques its own result and retries. The right one depends on whether the task is interactive, repeatable, or high-stakes.

How The Three Primitives Work Together

The primitives are not independent gadgets; they interlock on every turn of the agent loop. Walk one turn of a surveillance investigation and watch all three fire. The agent holds the goal and its progress in working memory. Its planning pattern — ReAct here — decides the next step is to run the person detector on the loading-bay camera. It executes that step through tool use, filling in the detector's JSON form with the camera ID and the time window. The detector returns three candidate clips, which land back in working memory as the new observation. Before inspecting them, the agent checks semantic memory for the facility's camera map to confirm which feed actually covers the bay, and episodic memory for whether this camera produced false hits last week. Then the loop turns again. Each primitive did its job, and none could have finished the task alone.

This is also why the order of these lessons matters. The agent-loop lesson gave you the loop. This lesson gives you the parts inside it. The next lesson, on agent frameworks, is about the tools engineers use to assemble these parts without building everything from scratch — LangGraph, CrewAI, and the rest exist precisely to manage tool calls, memory stores, and planning loops so a team does not hand-roll all three.

A single turn of the agent loop annotated to show all three primitives firing — working memory holding the goal, planning choosing the next step, tool use calling the detector, and episodic plus semantic memory consulted before the loop repeats Figure 4. One turn of a surveillance investigation. Planning picks the step, tool use executes it, working memory holds the state, and long-term memory is consulted before the loop turns again — all three primitives on every pass.

When To Keep It Simple

Each primitive earns its keep only when the task needs it, and the cheapest agent is the one that uses the fewest. If a job needs no information beyond what fits in one window, it needs no external memory — do not stand up a vector database for a task that summarizes a single ten-minute clip. If a job is one fixed sequence of steps, it needs no open-ended planner — write the steps as code, the way the previous lesson described a workflow. If a job calls one service once, it needs one tool, not a tool framework. The discipline that keeps agent projects from collapsing under their own cost is matching the primitive to the need: add memory when the task outgrows the window, add planning when the steps are not knowable in advance, and add tools one at a time as real tasks demand them.

Where Fora Soft Fits In

We build video products across conferencing, streaming, OTT, surveillance, e-learning, telemedicine, and AR/VR, and when a feature needs an agent, these three primitives are where the design work lives. In surveillance, that means giving an investigator agent a tight tool set and an episodic memory of past incidents so it does not re-flag the same shadow nightly. In conferencing, it means a meeting copilot whose semantic memory holds the team's context and whose working memory tracks the live call without trying to swallow a year of transcripts. In OTT and e-learning, it means async reviewers that plan a fixed ingest-and-tag sequence rather than reasoning from scratch on every file. Our job is to size each primitive to the task — and to leave out the ones the task does not need.

What To Read Next

Talk To Us · See Our Work · Download

  • Talk to a video AI engineer — scope which primitives your agent feature actually needs, and which it can skip: /services/llm-agent-development
  • See our case studies — surveillance, conferencing, OTT, and telemedicine work: /portfolio
  • Download the Agent Primitives field guide — tool use, the four memory types, and the three planning patterns on one page: Download the field guide

References

  1. Yao et al. — "ReAct: Synergizing Reasoning and Acting in Language Models," arXiv:2210.03629 (Princeton University & Google, 2022) — https://arxiv.org/abs/2210.03629 — tier 5 (primary algorithmic source). The canonical interleaved reason–act planning pattern; source of truth for ReAct.
  2. Shinn, Cassano, Berman, Gopinath, Narasimhan, Yao — "Reflexion: Language Agents with Verbal Reinforcement Learning," arXiv:2303.11366 (2023) — https://arxiv.org/abs/2303.11366 — tier 5 (primary). The reflection pattern: verbal self-critique stored and reused to improve on retry.
  3. Schick et al. — "Toolformer: Language Models Can Teach Themselves to Use Tools," arXiv:2302.04761 (Meta AI, 2023) — https://arxiv.org/abs/2302.04761 — tier 5 (primary). Foundational tool-use work; models learn when and how to call external tools (calculator, search, QA, translation, calendar).
  4. Sumers, Yao, Narasimhan, Griffiths — "Cognitive Architectures for Language Agents (CoALA)," arXiv:2309.02427 (Princeton University, 2023) — https://arxiv.org/abs/2309.02427 — tier 5 (primary). The four-part agent memory taxonomy: working, episodic, semantic, procedural. Source of truth for the memory-types table.
  5. Packer et al. — "MemGPT: Towards LLMs as Operating Systems," arXiv:2310.08560 (UC Berkeley, 2023) — https://arxiv.org/abs/2310.08560 — tier 5 (primary). OS-style virtual context management: fast in-context memory paged against a larger external store.
  6. Anthropic — "Building Effective Agents" (December 2024) — https://www.anthropic.com/research/building-effective-agents — tier 3. Workflow vs agent distinction; guidance to use the simplest design and add agentic complexity only when needed.
  7. Anthropic — "Introducing the Model Context Protocol" (November 2024) — https://www.anthropic.com/news/model-context-protocol — tier 3 (protocol author). MCP as an open standard for connecting agents to tools and data; the canonical description of the tool-standardization layer.
  8. Model Context Protocol — official specification and documentation — https://modelcontextprotocol.io — tier 3 (de-facto spec). Tool/server schema, discovery, and invocation model; 2026 governance under the Linux Foundation's Agentic AI Foundation.
  9. OpenAI — "Function calling" and "Structured Outputs" API documentation — https://platform.openai.com/docs/guides/function-calling — tier 4 (production deployer). JSON Schema function definitions; strict structured outputs that guarantee schema-valid tool arguments.
  10. Wikipedia — "Model Context Protocol" (orientation only) — https://en.wikipedia.org/wiki/Model_Context_Protocol — tier 6. Used only for the adoption timeline (Nov 2024 launch; OpenAI March 2025; foundation donation Dec 2025); each fact cross-checked against the primary announcements above.