Published 2026-05-23 · 16 min read · By Nikolay Sapunov, CEO at Fora Soft

Why this matters

If you ship a video product, the AI features your roadmap promises will be powered either by a model you downloaded or by a model you call over the network. That single decision shapes your cost curve, your privacy posture, your latency, your compliance burden, and your engineering headcount for the next three years. Get the format wrong and your iOS app ships with a 4 GB checkpoint that App Store review rejects; get the procurement wrong and you discover at 100,000 users a month that your OpenAI bill exceeds your AWS bill. This article is the cheat sheet that lets product, engineering, and finance teams agree on a procurement plan in one meeting instead of one quarter.

The two procurement decisions, in plain language

Every AI feature in your product starts with two questions. The first is closed-vs-open: are you renting a model from a vendor's API, or are you running the model yourself on hardware you control? The second is the format question: if you are running the model yourself, which file does the engineering team accept?

These questions are independent of each other in principle and tightly coupled in practice. A closed-API choice resolves both questions at once — the format is the vendor's HTTPS endpoint, and you do not care what is on the other side. An open-weights choice cracks the second question open: now you have a 13 gigabyte file on Hugging Face and you have to decide whether to load it with vLLM in your own datacenter, with llama.cpp on a developer's laptop, with Core ML on an iPhone, or with TensorRT on an NVIDIA GPU.

Think of it like building a house. The closed-vs-open decision is whether you rent or buy. The format decision, if you buy, is whether you take possession as raw lumber, as pre-fab walls, or as a finished apartment — each unlocks different tools, different speed, and different freedom to renovate.

This article walks both decisions end to end. We start with the five artifact formats — what each one is, who maintains it, what runtimes it feeds. Then we map each format to the deployment topology layer (on-device, edge, in-region cloud, cross-region cloud) from the previous lesson on latency and deployment topology. Then we turn to the open-vs-closed procurement question, with the cost arithmetic, compliance constraints, and the decision tree that resolves it.

The five model artifact formats that ship in 2026

Five formats cover roughly 95% of what you will see across video-AI projects. Two are pure tensor storage (Safetensors, the now-deprecated pickle/PyTorch .bin files). Two are inference-optimised packages that compile a model for a specific runtime (TensorRT engines, Core ML .mlpackage). One — ONNX — is the cross-platform intermediate. And one — GGUF — is the on-device, CPU-friendly format that powers most of the "run an LLM on your laptop" experience.

We walk them in this order: Safetensors (training canonical), GGUF (compressed local), ONNX (portable), TensorRT (NVIDIA-optimised), Core ML (Apple-optimised).

Safetensors — the canonical training checkpoint

Safetensors is a binary file format developed and open-sourced by Hugging Face in 2022. It stores raw tensor data — the millions or billions of numbers that are a trained neural network — together with a small JSON header that lists each tensor's name, shape, and data type. There is no executable code inside a Safetensors file, by design. The file cannot do anything when you open it. It can only hand you back the tensors it carries.

That last property is the reason Safetensors exists. PyTorch's original .bin and .pt files use Python's pickle serialization, which has a fundamental security flaw: loading a pickle file is allowed to execute arbitrary Python code, baked into the file when it was saved. A malicious actor who can replace a checkpoint on a model registry can make every downstream user run code of their choosing, simply by virtue of the user calling torch.load(). This is not theoretical — pickle-based supply-chain attacks against Hugging Face checkpoints have been documented since 2023. Safetensors was designed to make the attack impossible: the file contains only data, never code.

Safetensors files end with the extension .safetensors. They are usually sharded across multiple files for large models — a 70-billion-parameter Llama checkpoint will arrive as a folder containing model-00001-of-00030.safetensors through model-00030-of-00030.safetensors, plus an index.json that tells the loader which tensor lives in which file. The folder also contains a config.json describing the model architecture, a tokenizer.json for text models, and a README.md (the "model card") with license, training data, and benchmark information.

By 2026, Safetensors has displaced pickle as the de facto checkpoint format. Hugging Face publishes new model releases in Safetensors first; PyTorch has merged native Safetensors support into its core serialisation API, and the framework's documentation now recommends it as the default save format. If a 2026 model release ships only as .bin files, treat it as a yellow flag — either the provider is years behind, or the checkpoint is from an older training run that has not been re-saved.

What Safetensors is not: it is not a compressed format, and it is not a deployment format. A Llama 4 70B Safetensors checkpoint is roughly 140 GB in float-16 precision. You can train against a Safetensors file, you can fine-tune from one, and you can serve from one — provided you have the GPU memory to hold all 140 GB at once. For production deployment on smaller hardware, you will convert it into one of the other formats below.

GGUF — the compressed local-inference format

GGUF, short for "GGML Universal File", is a binary file format introduced in August 2023 by the maintainers of llama.cpp. It is the format that put large language models on your laptop. A GGUF file packages tensors, model metadata (architecture, context length, tokenizer rules), and end-application configuration into a single, memory-mappable binary that llama.cpp and its friends can load in milliseconds.

The defining feature of GGUF is its quantisation support. Quantisation, in plain language, is the art of storing each number in a model with fewer bits than it was trained with — replacing 16-bit floating-point weights with 8-bit integers, then with 4-bit, and increasingly with 2-bit or even 1.58-bit values. Each step down in precision shrinks the model on disk, lowers the memory it consumes when running, and lets it run on humbler hardware — at the cost of some accuracy. GGUF was the first format to standardise the metadata describing which quantisation scheme was used per-tensor, so a loader can read the file and reconstruct the original tensors at inference time.

The quantisation labels you will see on every GGUF download — Q4_K_M, Q5_K_M, Q8_0, F16, BF16 — are not arbitrary. They name specific algorithms. The naming convention works like this: the digit after Q is the average number of bits per weight (4, 5, 6, 8); K means "K-quants" (the modern variant); M means "medium" (a balanced size/quality variant, with S for small and L for large also available); F16 and BF16 are unquantised half-precision baselines. Recent additions like Q4_K_XL and the IQ family (imatrix-based "I-quants") squeeze a little more quality from each bit.

The trade-off table below is the one decision every team makes when shipping a local-inference feature. It is built from independent 2026 benchmarks comparing each quantisation against the unquantised baseline on the same model. Read the numbers as ballpark: actual values move slightly per model and per benchmark, but the relative ordering is stable.

Quantisation Bits per weight (avg) Size vs F16 Quality retained Typical home
F16 / BF16 16 100% 100% (baseline) Server with abundant VRAM
Q8_0 8 50% ~99% High-end laptop, single workstation GPU
Q6_K 6 38% ~98% Mid-range workstation
Q5_K_M 5 31% ~97% Apple Silicon MacBook (16–24 GB)
Q4_K_M 4.5 28% ~95% Most consumer laptops (8–16 GB)
Q3_K_M 3.5 22% ~88% Memory-constrained edge devices
Q2_K / IQ2 2.5 16% ~75% Last-resort fits

Worked example. A Llama 3 70B model in F16 takes about 140 GB on disk and about 145 GB of VRAM to run — out of reach for almost every laptop. The same model at Q4_K_M takes 140 × 0.28 ≈ 39 GB, runs comfortably on a 64 GB Mac Studio, and retains roughly 95% of the original's task accuracy. Step further to Q2_K and the file shrinks to 22 GB, but quality drops below the threshold where most production users will accept it.

GGUF is read by llama.cpp first, and then by every project built on top of it: Ollama (a wrapper that makes llama.cpp easy to run as a service), LM Studio (a desktop UI for trying local models), GPT4All, Jan, and koboldcpp. As of 2026, Hugging Face hosts tens of thousands of GGUF checkpoints with a built-in metadata viewer and inference endpoint service. The Hugging Face GGUF library page lists models by quantisation, so a team picking a release can filter directly to Q4_K_M 7B variants when shopping for a feature.

What GGUF is not: it is not used for training, and it is not the format you ship to most production servers. GGUF is optimised for CPU and unified-memory inference (Apple Silicon, AMD APUs, recent Intel chips), and its GPU support, while functional via the CUDA backend of llama.cpp, lags vLLM and TensorRT on raw throughput. If you serve thousands of concurrent requests on NVIDIA hardware, you will not pick GGUF — you will pick vLLM with Safetensors, or TensorRT with a compiled engine.

ONNX — the cross-platform exchange format

ONNX, short for "Open Neural Network Exchange", is an open-source format originally co-developed by Microsoft and Facebook in 2017 and now governed by the Linux Foundation. It exists to solve a specific problem: a model trained in PyTorch should be loadable in TensorFlow, in Apple's runtime, in NVIDIA's runtime, and in a browser, without rewriting the model in each framework. ONNX defines a stable on-disk graph format — a directed acyclic graph of mathematical operations — that every major framework knows how to export to and import from.

The file extension is .onnx. A single file holds the model graph (every operation, every weight, every connection between operations) together with the input and output tensor shapes. Tooling exists in every framework: PyTorch's torch.onnx.export() is the most common path, and Hugging Face's optimum library has high-level helpers for transformer models. The ONNX model zoo on GitHub hosts canonical exports of computer vision models (ResNet, YOLO families, MobileNet) and many speech and language models.

ONNX is most useful when paired with ONNX Runtime — Microsoft's cross-platform inference engine. ONNX Runtime accepts an .onnx file and chooses an "execution provider" at load time: CUDA on NVIDIA GPUs, TensorRT for further optimisation on NVIDIA, DirectML on Windows GPUs, CoreML on Apple, OpenVINO on Intel, QNN on Qualcomm. The same file runs everywhere; the runtime picks the fastest backend present on the host. On an Azure Cobalt 100 (Arm64) server with the SqueezeNet-INT8 model, ONNX Runtime sustains over 538 inferences per second with peak memory under 37 MB — typical of the "lean and portable" envelope ONNX targets.

What ONNX is good for in a video product: exporting a YOLO detector once, then deploying the same artifact to a browser via ONNX Runtime Web, to an Android phone via ONNX Runtime Mobile, to a Linux edge box via ONNX Runtime + OpenVINO, and to a cloud GPU via ONNX Runtime + CUDA. Every Ultralytics YOLO release ships an .onnx export alongside the PyTorch checkpoint precisely because of this multi-platform property.

What ONNX is not: it is not as fast on any specific hardware as that hardware's native format. A YOLO model in ONNX Runtime on an NVIDIA GPU will be 1.5× to 3× slower than the same model converted to a TensorRT engine. ONNX is the right answer when portability matters more than raw throughput; TensorRT is the right answer when throughput matters more than portability.

A second caveat: not every model architecture exports cleanly. Newer transformer variants, custom attention kernels, dynamic control flow — these can fail or degrade during ONNX export. Always export, then run a quality check (PSNR, accuracy, or task-specific metric) to confirm the exported model behaves like its source. The 2022 study of deep learning model conversion challenges found that errors during PyTorch-to-ONNX export are common enough to require a routine post-conversion validation step.

TensorRT — the NVIDIA-compiled engine

TensorRT is NVIDIA's inference SDK and the file format it produces — a compiled "engine" optimised for a specific GPU architecture (Ampere, Hopper, Blackwell) and a specific model. You feed TensorRT an ONNX file (or a model defined in PyTorch via torch_tensorrt) and it produces a .engine or .plan file. That file is faster than any general-purpose runtime can be — TensorRT does layer fusion, kernel auto-tuning, precision calibration (FP16, INT8, FP8, FP4 on newer GPUs), and memory layout optimisations that target the exact silicon you compiled against.

Throughput improvements from TensorRT compilation depend on the model and the workload, but are routinely 2× to 5× over PyTorch eager-mode execution on the same GPU, and 1.5× to 3× over ONNX Runtime with the CUDA execution provider. For LLM serving specifically, NVIDIA's TensorRT-LLM library (a TensorRT specialisation for transformer language models) achieves near-state-of-the-art throughput, often within 10–20% of vLLM and ahead on certain quantised workloads.

What TensorRT is good for in a video product: production serving of computer vision models at high throughput on NVIDIA hardware. A surveillance system processing thousands of camera streams in parallel on an H100 cluster will run YOLO and SAM 2 through TensorRT engines, not through PyTorch. Real-time video VLM inference, where every millisecond of GPU time costs money, lives here.

What TensorRT is not: it is not portable across GPU generations. An engine compiled for an A100 will not run on a B200. You re-compile when you change hardware, which adds operational complexity. It is not a format you publish — it is a format you produce locally for the GPU you happen to own. And it is not the right answer if your workload runs on anything other than NVIDIA: TensorRT is NVIDIA-only.

Core ML — the Apple device package

Core ML is Apple's on-device machine learning runtime and the file format it consumes — .mlpackage (the modern, extensible format introduced in Xcode 13) or the older .mlmodel. A .mlpackage is a folder, not a single file, containing the model's graph (in Apple's ML Program format), the weights, and a JSON manifest. The package is consumed by Core ML at runtime on iPhone, iPad, Mac, Apple Watch, and Vision Pro.

Conversion is via the coremltools Python package, maintained by Apple. The typical path is PyTorch → ONNX → Core ML, or PyTorch → Core ML directly using coremltools.convert(). The output .mlpackage ships inside the app bundle in App Store distribution, or is downloaded on first launch from the developer's CDN.

What Core ML is good for in a video product: on-device features in iOS and macOS apps where latency matters and the data must not leave the device. Background blur, beauty filters, gaze correction, on-device segmentation, small ASR models for live captioning, on-device object detection in surveillance apps. Core ML picks the best available hardware on the device — the Neural Engine on iPhones with A14 and later, the GPU otherwise, the CPU as a fallback — and the model runs without any network traffic. The Vision framework on iOS pairs naturally with Core ML for image and video pipelines.

What Core ML is not: it is not the format you ship to anything other than Apple platforms. It is also not a great choice for very large LLMs; Apple's MLX framework (a separate but related project) is the better path for transformer language models on Apple Silicon, with native Metal kernels and unified-memory awareness. Core ML excels at convolutional and lightweight transformer models; for serving Llama 70B on a Mac, MLX and llama.cpp are the contenders.

Five model artifact formats arranged by purpose and platform, with arrows showing the typical conversion paths between them. Figure 1. The five formats and their conversion paths. Safetensors is the canonical training checkpoint; everything else is a deployment derivative.

How the formats fit the deployment topology

The previous lesson on latency and deployment topology introduced four layers where a model can physically live: on-device, on-edge, in-region cloud, cross-region cloud. The five formats map onto those layers like this.

On a user's device — phone, laptop, browser, IP camera SoC — the format is platform-specific. iOS and macOS apps use Core ML. Android apps use LiteRT (formerly TensorFlow Lite) or ONNX Runtime Mobile. Browsers use ONNX Runtime Web with the WebGPU execution provider, or transformers.js, or a WASM-compiled llama.cpp for the LLM case. Linux edge devices with a discrete GPU use TensorRT (if NVIDIA), OpenVINO (if Intel), or ONNX Runtime with the matching provider. Developer laptops and prosumer Macs running an LLM locally use GGUF via llama.cpp or Ollama.

At the edge — a CDN node, a 5G mobile edge compute site, a regional GPU pool — the choice narrows. Cloudflare Workers AI accepts ONNX models for the most part; AWS Lambda and similar serverless GPU offerings prefer ONNX or compiled TensorRT engines. A LiveKit Agents worker living on an edge GPU typically runs a quantised model in vLLM (Safetensors) or a GGUF model via Ollama, depending on the throughput target.

In an in-region cloud — your own AWS, GCP, Azure, or specialised GPU cloud (Modal, Replicate, Together, Fireworks, RunPod) — Safetensors is the source of truth and the format your serving stack consumes. vLLM, the de facto open-weights LLM server, loads Safetensors directly. NVIDIA Triton Inference Server accepts ONNX, TensorRT engines, PyTorch, TensorFlow, and Python backends. For maximum throughput on NVIDIA hardware at scale, the chain is Safetensors → TensorRT-LLM compilation → engine deployment. For maximum portability, the chain is Safetensors → ONNX export → Triton or ONNX Runtime serving.

In cross-region cloud — almost always a closed API. If you are calling Anthropic, OpenAI, or Google from a region where they do not host the model you need, you are paying a cross-region latency penalty for the privilege of using a frontier model. The format here is whatever the vendor exposes over HTTPS — usually OpenAI-compatible JSON.

The table below summarises the matrix. Read it as "given a layer and a job, here is the format you should expect to use".

Layer LLM Computer vision Speech / audio
On-device (iOS) Core ML or MLX Core ML Core ML or Whisper.cpp (GGUF)
On-device (Android) ONNX Runtime Mobile, LiteRT LiteRT or ONNX LiteRT, ONNX, or Whisper-WASM
On-device (browser) llama.cpp WASM (GGUF), transformers.js (ONNX) ONNX Runtime Web (WebGPU) Whisper WASM, transformers.js
Edge (CDN, MEC, regional) vLLM (Safetensors) or Ollama (GGUF) ONNX Runtime + TensorRT EP ONNX or TensorRT
In-region cloud (NVIDIA) vLLM (Safetensors) or TensorRT-LLM TensorRT engine TensorRT or ONNX
In-region cloud (CPU) Ollama (GGUF) ONNX Runtime ONNX or Whisper.cpp
Cross-region cloud Closed API (HTTPS) Closed API Closed API

Common procurement mistakes — and the pitfalls each format hides

Before we move to the open-vs-closed decision, a few common mistakes are worth flagging, because they recur on every project that does not have someone on the team who has shipped this before.

Mistake one: shipping a Safetensors checkpoint inside a mobile app. This is the most common rookie error, and it gets caught at App Store or Play Store review. A 13 GB checkpoint will not pass mobile app size limits, and even if it did, the device cannot load it into memory. The fix is to convert: PyTorch or Safetensors → ONNX → Core ML (for iOS) or LiteRT (for Android). The mobile-deliverable model is always a converted, quantised derivative — never the training checkpoint.

Mistake two: using GGUF on a high-throughput NVIDIA server. GGUF can run on CUDA via llama.cpp, but the throughput envelope is wrong: under high concurrency, vLLM with Safetensors or TensorRT-LLM with a compiled engine will outperform llama.cpp by 3× to 4× on the same hardware. Pick GGUF for laptop developers and Apple Silicon Macs; pick vLLM or TensorRT for server farms.

Mistake three: trusting a pickle-based checkpoint from an unfamiliar publisher. If a model release ships only .bin or .pt files and the publisher is not a well-known institution, do not load it without scanning. Tools like picklescan exist precisely for this case. Better: refuse the artifact and ask the publisher for a Safetensors export. By 2026 there is no good reason for a new release to be pickle-only.

Mistake four: assuming "open weights" means "MIT licensed". It almost never does for frontier models. Llama 4 is under the Llama 4 Community License — commercial use is permitted only for entities with fewer than 700 million monthly active users, attribution is required on derivatives, and the multimodal variants are explicitly not licensed for individuals or companies headquartered in the European Union. Mistral 3, by contrast, is genuinely Apache 2.0. Qwen 3 has a custom license. DeepSeek has a custom license. MiniMax recently shifted to a "Modified-MIT" that restricts commercial deployment without written authorisation. Procurement must read every weight licence before sign-off — there is no general rule.

Mistake five: comparing closed-API and open-weights cost on token price alone. Closed APIs charge per million tokens. Open weights charge per GPU-hour, plus engineering headcount to operate the stack, plus the load factor of your concurrency profile. The economics depend on volume — closed wins under low volume, open wins by a wide margin at scale. We compute the breakeven below.

The open-vs-closed procurement decision

We can now turn to the second question. You have read the format map. Whether you ever touch a format file at all depends on a single procurement decision: do you rent the intelligence (closed API), or do you own the artifact and run it yourself (open weights)?

There is no right answer in the abstract. There is only a right answer for a given use case at a given volume. The framework below is the one we use on Fora Soft projects when a client asks us to scope an AI feature.

Step 1 — Eliminate by privacy and compliance

The first filter is binary. If the data the feature processes cannot leave your network — under HIPAA in the United States, under GDPR for EU citizens, under PCI for payment data, under SOC 2 Type II commitments, under EU AI Act Article 50 transparency obligations, or under any sector-specific rule — the closed API is disqualified for the data path. You can still use a closed API to test ideas in development, but production traffic must pass through hardware you control. This is open-weights territory by elimination, regardless of cost.

The EU AI Act fully applies from August 2, 2026 for most operators. Annex XII obliges providers of general-purpose AI models to deliver technical documentation to downstream integrators — a requirement that is harder to satisfy when the model is a closed-API black box, and easier when you control the artifact.

Telemedicine, regulated finance, defence, and many enterprise deployments fall into this bucket by default. Most consumer video features do not.

Step 2 — Eliminate by capability gap

The second filter looks at whether the task you need actually has an open-weights model good enough to do it. As of mid-2026, the capability landscape is roughly this:

For frontier reasoning — multi-step planning, advanced math, hardest agentic workflows — closed models (Claude Opus 4.6, GPT-5, Gemini 3.1 Deep Think) still hold a measurable lead on benchmarks like GPQA Diamond and Humanity's Last Exam. Open-weights frontier models (Llama 4 Maverick, DeepSeek V4, Qwen 3) are close but not equal.

For most production workloads — content moderation, summarisation, transcription, translation, computer vision tasks, video understanding at common scales — open-weights models match or exceed closed models on task-specific benchmarks. A surveillance pipeline running YOLO and SAM 2 has no reason to call a closed API.

For task-specific small models — speech recognition (Whisper), embedding generation, object detection (YOLO, RT-DETR), segmentation (SAM 2, Florence-2) — open weights have been the standard for years. The closed competitors lag.

If your feature lives in the frontier reasoning bucket and quality drives your business, closed wins. If it lives in any other bucket, open is at least competitive.

Step 3 — Compute the cost crossover

Once compliance and capability are settled, run the cost arithmetic. Closed API charges per million input and output tokens; open weights charge per GPU-hour plus engineering operating cost.

Worked example. You ship an AI meeting summary feature. Each meeting produces 8,000 input tokens and 800 output tokens. You expect 100,000 summaries per month at launch, growing to 1,000,000 per month within a year.

Closed API math (using GPT-4-equivalent pricing of $0.40 per million tokens — the 2026 floor for the capability tier):

Cost per summary = (8,000 / 1,000,000 × $0.40) + (800 / 1,000,000 × $0.40) = $0.0032 + $0.00032 ≈ $0.0035.

At 100,000 summaries/month: $350/month. At 1,000,000 summaries/month: $3,500/month.

Open weights math (vLLM serving a 70B-class model on a single H100 SXM at $2.50/hour, with realistic batching achieving 200 summaries/hour per GPU at the same quality):

Hours per month at 100,000 summaries = 100,000 / 200 = 500 hours. At $2.50/hour = $1,250/month — plus operations, monitoring, autoscaling, and engineering oncall. At a single-GPU-equivalent steady state, the closed API wins.

At 1,000,000 summaries/month = 5,000 hours = ~7 GPUs continuously busy. With batching and reasonable utilisation, this lands around $7,000–$9,000/month of GPU spend, plus ~$5,000/month of engineering and infrastructure. At this scale, open weights begin to win because the closed API would cost $35,000/month while open weights costs roughly $14,000/month all-in — a 2.5× delta.

The crossover point depends on three variables: the cost per token of the closed API, the cost per GPU-hour of your open-weights deployment, and how efficiently you batch. For most 2026 video-AI workloads, the closed API is cheaper below 5 to 10 million tokens per day; open weights wins above 50 to 100 million tokens per day; the middle is a judgment call that depends on engineering capacity.

LLM inference cost per token has dropped roughly 10× year-on-year since 2023, and the crossover keeps moving — what was an open-weights win in 2024 may be a closed-API win in 2026 for the same volume. Re-run the math annually.

Step 4 — Apply the engineering tax

Open-weights deployments cost engineering time that closed APIs do not. A realistic baseline for running a single open-weights LLM service in production in 2026 includes one to two ML engineers maintaining the serving stack (vLLM, Triton, autoscaling, monitoring), one DevOps engineer managing GPU capacity and incidents, and shared time from a security engineer for supply-chain scanning and an SRE for oncall. Call it $30,000 to $60,000 per month of fully loaded engineering cost in a US or Western European market.

For a closed API, the engineering cost is the time of one application engineer who calls the HTTPS endpoint, plus prompt-engineering time. Call it $10,000 to $20,000 per month.

The delta — roughly $20,000 to $40,000 per month — is the cost of optionality. You are paying it to own the artifact, to keep data on your hardware, to swap models without renegotiating contracts, and to control quality and latency on your timeline. For a feature that drives material revenue, this is cheap. For an experimental feature, it is expensive.

Step 5 — Pick the hybrid

The right answer for most teams is not a side. It is a router. You route the easy 80% of traffic to an open-weights model you host, you route the hard 20% to a closed API, you measure, and you adjust quarterly. You start with closed for speed-to-launch, you migrate specific high-volume or privacy-sensitive workloads to open as the data justifies it, and you keep the closed fallback for anything the open model misses.

This is the pattern Fora Soft uses on virtually every production AI integration we ship in 2026, and it is the pattern most agencies and engineering teams have converged on.

Decision tree for the open-vs-closed procurement question, branching on data residency, capability gap, expected volume, and engineering capacity. Figure 2. The procurement decision tree. Each diamond is a filter that eliminates options before you spend engineering time.

Where Fora Soft fits in

Fora Soft has shipped AI integrations into video products since 2019 — across video conferencing, OTT streaming, surveillance, telemedicine, and e-learning. We have run every artifact format described above in production: Safetensors on vLLM clusters for meeting copilots, GGUF on Mac Studios for on-prem deployments, ONNX in browsers for client-side moderation, TensorRT engines for surveillance analytics, Core ML packages for on-device beauty filters and live captions. The procurement framework in this article is the one we use when a client asks "should we build this on OpenAI or on our own GPU?". The right answer is almost always "both, routed by use case" — and the format map above is how that routing actually gets implemented.

A cost-crossover spreadsheet for your own numbers

To make the cost arithmetic concrete for your own product, we are publishing a one-page checklist that walks through the five-step procurement framework with the inputs you need to gather. It is the same artifact our engineers print before scoping calls. Download the model artifact + procurement audit checklist (PDF).

What to read next

Talk to us · See our work · Download

References

  1. GGUF File Format Specification, ggml-org/llama.cpp, accessed 2026-05-23. https://github.com/ggml-org/llama.cpp/blob/master/ggml/include/gguf.h — primary source for GGUF binary layout and metadata schema.
  2. GGUF documentation, Hugging Face Hub, accessed 2026-05-23. https://huggingface.co/docs/hub/en/gguf — distribution layer, tooling integration, model browser.
  3. Safetensors specification and security model, Hugging Face, accessed 2026-05-23. https://github.com/huggingface/safetensors — primary source for the canonical training checkpoint format.
  4. ONNX specification and runtime documentation, ONNX Foundation / Microsoft, accessed 2026-05-23. https://onnx.ai/ and https://onnxruntime.ai/ — primary source for cross-platform inference.
  5. TensorRT documentation, NVIDIA Developer, accessed 2026-05-23. https://docs.nvidia.com/deeplearning/tensorrt/ — primary source for compiled engine format and optimisation passes.
  6. Core ML and coremltools documentation, Apple Developer, accessed 2026-05-23. https://developer.apple.com/documentation/coreml — primary source for .mlpackage format and Apple device runtime.
  7. Llama 4 Community License Agreement, Meta, accessed 2026-05-23. https://www.llama.com/llama4/license/ — controlling licence for Llama 4 derivatives, including the 700 M MAU and EU multimodal clauses.
  8. EU Artificial Intelligence Act — Article 50, Annex XII, European Commission, accessed 2026-05-23. https://artificialintelligenceact.eu/article/50/ — transparency obligations applying from August 2, 2026.
  9. Open Source vs Closed LLMs: The 2026 Decision Framework, Let's Data Science, May 2026. https://letsdatascience.com/blog/open-source-vs-closed-llms-choosing-the-right-model-in-2026 — capability gap analysis and cost crossover modelling.
  10. Practical GGUF Quantization Guide, Enclave AI, November 2025. https://enclaveai.app/blog/2025/11/12/practical-quantization-guide-iphone-mac-gguf/ — independent quantisation quality benchmarks on Apple Silicon.
  11. Q4_K_M vs Q5_K_M vs Q8 — Which GGUF Quantization Should You Use?, Will It Run AI, 2026. https://willitrunai.com/blog/quantization-guide-gguf-explained — quantisation labels and trade-offs.
  12. vLLM project documentation and releases, vllm-project/vllm, accessed 2026-05-23. https://github.com/vllm-project/vllm — open-weights serving engine and PagedAttention reference.
  13. Best Open-Source LLM Models in 2026, Hugging Face blog, May 2026. https://huggingface.co/blog/daya-shankar/open-source-llms — open-weights landscape and licence shifts.
  14. PyTorch native Safetensors support, PyTorch documentation, 2026. https://pytorch.org/docs/stable/notes/serialization.html — deprecation of pickle as the default save format.