Why this matters

If you are scoping or running a surveillance system, "ONVIF vs RTSP" sounds like a choice you have to make — and it is not; they are different jobs that run together, and getting that wrong is the root of a surprising number of failures. A stream that plays perfectly on the test bench can stutter or vanish the moment it crosses a firewall, a router, or a VPN, and the reason is almost always a transport decision nobody made on purpose. Understanding the four protocols — what each one does, what it cannot do, and where it breaks — lets you read a camera datasheet without guessing, ask a vendor the right questions, and predict the bandwidth and latency a design will actually deliver. You will not write a line of networking code; you will gain the mental model that separates a system that survives a bad network day from one that only demos well. This is the layer beneath the ingest overview — the wire itself, in detail.

ONVIF vs RTSP vs RTP: who does what

Start by clearing up the question almost every buyer asks: "should I add this camera over ONVIF or over RTSP?" The honest answer is that the two are not competitors — they are different stages of the same conversation, and a typical system uses both at once. Confusing them is like asking whether you should reach a colleague "by phone or by talking" — one is the connection, the other is what travels over it.

Four named technologies move a camera's video into the recording software, and each has exactly one job. The software that ingests and records many camera streams is called a Video Management System (VMS); everything below describes how a camera's video reaches it.

ONVIF is the concierge. ONVIF — the Open Network Video Interface Forum standard — lets cameras and software from different makers understand each other. Its job here is to find the camera on the network and hand over the address of its video stream. ONVIF does not carry a single frame of video. When you "add a camera as ONVIF," the VMS uses ONVIF to discover the device, authenticate, and ask, "what is the address of your stream?" — and the camera answers with an RTSP address. (The full ONVIF story is in ONVIF explained for engineers; for the commercial overview, see Fora Soft's guide to ONVIF profiles in security systems.)

RTSP is the remote control. Once the VMS has that stream address, it connects with the Real-Time Streaming Protocol and presses buttons: describe yourself, set up the stream, play, stop. RTSP carries the commands, not the pictures. When you "add a camera as RTSP," you skip the ONVIF concierge and type the stream address by hand — which is why RTSP-only setups work but lose ONVIF's discovery, events, and configuration.

RTP is the moving van. The Real-Time Transport Protocol is what actually carries the compressed video across the network, one packet at a time. Everything you see on screen arrived as RTP.

RTCP is the delivery clipboard. The RTP Control Protocol rides alongside RTP, carrying no video — just short reports, in both directions, on how delivery is going: how many packets arrived, how many were lost, how jittery the timing was.

Protocol Plain-language role Carries video? Standard
ONVIF Finds the camera, hands over the stream address, configures it No ONVIF profiles (S, T)
RTSP The remote control: describe, set up, play, stop No IETF RFC 2326 / 7826
RTP The moving van: carries the compressed video packets Yes IETF RFC 3550
RTCP The clipboard: reports delivery quality both ways No IETF RFC 3550

Table 1. The division of labor. "ONVIF vs RTSP" is a false choice — ONVIF sets up the call, RTSP controls it, and RTP carries it. Only RTP moves video.

Four roles in moving surveillance video: ONVIF finds the camera, RTSP controls the session, RTP carries the video, RTCP reports quality. Figure 1. The four protocols as a relay, not a contest. ONVIF discovers the camera and hands over the stream address; RTSP starts and stops the session; RTP carries the actual compressed video; RTCP reports on delivery quality. Only RTP moves pictures.

RTSP: the remote control

RTSP is best understood through the standard's own words: "RTSP acts as a 'network remote control' for multimedia servers" (IETF RFC 2326, §1.1). It does not move video; it presses buttons on the camera, and the camera responds. It listens on a well-known door, TCP port 554 (RFC 2326, §3.2).

The buttons are a short, named set. Four are required of any RTSP system — OPTIONS, SETUP, PLAY, and TEARDOWN — and two more are in near-universal use, DESCRIBE and PAUSE (RFC 2326, §10). Walk the sequence the way a VMS does it. OPTIONS asks which commands the camera supports. DESCRIBE asks the camera to describe what it offers, and the camera answers with a small text document — a Session Description Protocol (SDP) block — that lists the available streams and, crucially, the codec each one uses. SETUP negotiates how the video will travel (the transport decision, below). PLAY starts the flow. TEARDOWN closes the session cleanly.

One property of RTSP trips up engineers who expect it to behave like a web request: RTSP is stateful. A normal web page is stateless — each request stands alone. RTSP is the opposite: "An RTSP server needs to maintain state by default in almost all cases, as opposed to the stateless nature of HTTP" (RFC 2326, §1.1). The camera remembers, between commands, that it set up a session and is now playing it; it tracks each session through states the standard names INIT, READY, and PLAYING. That memory is why a half-closed RTSP session can leave a camera holding resources, and why a well-built VMS always issues a TEARDOWN rather than just dropping the connection.

A modern note on versions: RTSP 2.0 was published as IETF RFC 7826 in December 2016 and formally obsoletes the 1998 RFC 2326. In practice, the overwhelming majority of cameras in the field still speak RTSP 1.0, so a VMS that ingests real-world hardware is built against RFC 2326 first and treats 2.0 as the exception. The protocol's internals — the full grammar, the proxy behavior, the interaction with general streaming infrastructure — belong to the streaming discipline; our Video Streaming section covers the transport theory in depth, while this article keeps the surveillance framing.

RTP: what is actually inside the moving van

When PLAY starts the flow, the video crosses the network as a stream of RTP packets. RTP is defined in IETF RFC 3550, an Internet Standard, and the useful thing to understand is what each packet carries besides the video itself, because those few extra bytes are what make live video survive an imperfect network.

Every RTP packet begins with a fixed 12-byte header (RFC 3550, §5.1). Most of it is bookkeeping, but three fields do the heavy lifting, and each maps directly to a real surveillance problem.

The sequence number is a 16-bit counter that "increments by one for each RTP data packet sent, and may be used by the receiver to detect packet loss and to restore packet sequence" (RFC 3550, §5.1). This is how the VMS knows a packet went missing (a gap in the count) and how it reassembles packets that arrive out of order — both routine events on a busy network.

The timestamp is a 32-bit field that "reflects the sampling instant of the first octet in the RTP data packet" (RFC 3550, §5.1). It is the clock that keeps motion smooth and keeps audio aligned with video. Without it, frames would play at the speed they happened to arrive, not the speed they were captured.

The SSRC (synchronization source) is a 32-bit random identifier for the stream's source. On a camera sending video and audio, each gets its own SSRC, so the VMS can tell the two apart even when they share a connection.

A fourth field, the payload type, names the format inside — for example, which codec the bytes represent — so the receiver knows whether it is holding H.265 or H.264.

Anatomy of the 12-byte RTP header: sequence number for loss and reordering, timestamp for sync, SSRC for the source, then the compressed video payload. Figure 2. Inside one RTP packet. The 12-byte header is small but decisive: the sequence number detects loss and restores order, the timestamp keeps video smooth and synced, and the SSRC identifies the source. The compressed video rides behind it.

How one frame becomes dozens of packets

Here is the part the vendor explainers skip, and it explains a class of real bugs. A single video frame — especially a keyframe, the full picture the codec sends periodically — is far too big to fit in one network packet. A standard Ethernet network moves packets of about 1,500 bytes at a time (the Maximum Transmission Unit, or MTU); after the IP and UDP headers, roughly 1,400 bytes are left for the actual payload. A compressed keyframe from a 4-megapixel camera can be tens of kilobytes. The video has to be cut up.

The rules for cutting it up live in the RTP payload format for each codec: RFC 7798 for H.265/HEVC (March 2016) and RFC 6184 for H.264 (May 2011). Both define the same three structures, named differently. A single NAL unit packet carries one small, self-contained piece of video on its own. An aggregation packet bundles several small pieces into one packet, so the network is not wasted on tiny payloads. A fragmentation unit does the opposite: it splits one large piece across many packets, which is how a big keyframe travels.

Make it concrete. Suppose a keyframe compresses to about 80 KB. At roughly 1,400 bytes of payload per packet, that single frame becomes about 80,000 ÷ 1,400 ≈ 58 RTP packets, each tagged with the next sequence number, all carrying the same timestamp because they belong to the same frame. The VMS collects all 58, checks the sequence numbers for gaps, reassembles them into the whole keyframe, and only then can it be decoded. If even one of those 58 packets is lost on a UDP transport, the entire keyframe can be unusable — which is the technical reason a brief network hiccup shows up as a few seconds of smeared or frozen video, not a single dropped pixel.

RTCP: the quality back-channel everyone forgets

Running quietly alongside the RTP video is its control companion, RTCP, also defined in RFC 3550. It carries no video. Its job is feedback: periodically, the sender and receiver exchange short reports. The camera sends a Sender Report (RTCP packet type 200) describing what it has transmitted and a timestamp that lets the VMS align audio with video; the VMS sends back a Receiver Report (type 201) describing what it received — the fraction of packets lost, the cumulative loss, and the jitter (the variation in packet arrival timing). RFC 3550 defines five report types in all (Sender Report, Receiver Report, Source Description, BYE, and APP).

This back-channel is how a surveillance system knows a stream is degrading before an operator notices a frozen tile. Loss and jitter numbers from RTCP are the raw material for the health dashboards a serious VMS shows — "camera 47 is dropping 4% of packets" is an RTCP fact.

RTCP is deliberately frugal so its reports never crowd out the video. The standard is explicit: "It is RECOMMENDED that the fraction of the session bandwidth added for RTCP be fixed at 5%" (RFC 3550, §6.2), split so that senders use about a quarter of that and receivers three-quarters. To avoid a storm of reports on a large system, the standard also sets a floor on how often any one party reports: "The RECOMMENDED value for a fixed minimum interval is 5 seconds" (RFC 3550, §6.2). For a surveillance fleet this matters: RTCP overhead stays near 5% of the media bandwidth no matter how many cameras you add, so it is a rounding error in capacity planning — but it is the rounding error that tells you the system is healthy.

The decision that defines the stream: UDP, TCP, or HTTP

Everything above is fixed by the standards. The one genuine choice you make is at SETUP: how RTP travels. It is the single most consequential networking decision in a surveillance design, because it decides whether the stream crosses the network at all.

The first option is RTP over UDP, the historical default. UDP is fire-and-forget: it never waits for acknowledgements and never resends a lost packet, which gives the lowest latency. On a clean, managed local network that is exactly right. But UDP RTP uses a pair of dynamically negotiated ports, and a firewall or any Network Address Translation (NAT) — the address-sharing between a private network and the internet — has no fixed rule to let them through. Across the internet, UDP RTP often simply does not arrive.

The second option is RTP interleaved over the RTSP/TCP connection. Here the video is packed into the same TCP connection already carrying the RTSP commands on port 554, so there is only one connection for a firewall to allow, and TCP guarantees every byte arrives in order. RFC 2326 defines the framing exactly: stream data "is encapsulated by an ASCII dollar sign (24 hexadecimal), followed by a one-byte channel identifier, followed by the length of the encapsulated binary data as a binary, two-byte integer" (RFC 2326, §10.12). The cost is latency: when TCP loses a packet it pauses everything to retransmit it — head-of-line blocking — which can turn a brief glitch into a short freeze.

The third option, common in surveillance and skipped by most explainers, is RTP/RTSP tunneled over HTTP(S). The whole RTSP/RTP exchange is wrapped inside ordinary web traffic on port 80 or 443. ONVIF's own streaming specification lists this among its transport options precisely because it traverses the strict corporate proxies and firewalls that block everything else — at the price of the most overhead. ONVIF, in fact, builds all of this on the same IETF protocols: its streaming spec describes "a set of media streaming … options, all based on RTP," with "media control … accomplished over RTSP," and enumerates RTP/UDP, RTP/RTSP/TCP, and RTP/RTSP/HTTP/TCP as transports (ONVIF Streaming Specification).

Transport How it travels Latency Firewall / NAT Best for
RTP over UDP Separate dynamic ports Lowest Often blocked Clean, managed LAN
RTP interleaved over TCP Inside the RTSP connection (554) Higher (retransmits) Passes — one connection Firewalled / NAT links
RTP/RTSP over HTTP(S) Wrapped in web traffic (80/443) Highest Passes strict proxies Internet, corporate proxies

Table 2. The transport choice at SETUP. Move down the table as the network gets more hostile: UDP on a clean LAN, TCP interleaved across a firewall, HTTP tunneling through a strict proxy. Many VMS platforms try UDP first and fall back automatically.

Three ways RTP travels: UDP on separate ports, interleaved inside the RTSP TCP connection, or tunneled over HTTP, ordered from lowest latency to most firewall-friendly. Figure 3. The same RTP video, three ways to carry it. UDP is fastest but firewall-blocked; TCP interleaving needs only the one RTSP connection; HTTP tunneling passes the strictest proxies. The cleaner the network, the higher up the table you can stay.

The deeper networking behind this choice — why UDP and TCP behave the way they do, and how NAT traversal really works — is streaming-transport territory, covered in TCP vs UDP for streaming and NAT, STUN, TURN, and ICE. Here the rule is enough: the cleaner the network, the higher up Table 2 you can stay.

What one camera actually puts on the wire

Tie it together with arithmetic for a single camera, because the numbers make the abstractions concrete. Take a 4-megapixel camera recording continuously in H.265 at an average of 3 Mbps (about half what the same camera would need in H.264). The math runs in three short steps.

First, turn the bitrate into bytes per second:

3 Mbps ÷ 8 = 0.375 megabytes per second = 375,000 bytes/sec

Second, turn bytes into packets, at roughly 1,400 bytes of payload each:

375,000 ÷ 1,400 ≈ 268 RTP packets per second

Third, add the RTCP overhead, fixed near 5% of the media:

3 Mbps × 5% ≈ 0.15 Mbps for RTCP — a rounding error

So one ordinary camera emits on the order of 268 video packets every second, each with its own sequence number, plus a trickle of RTCP reports a few times a minute. Multiply across a 40-camera site and the VMS is reassembling roughly 10,000 packets per second into coherent frames, watching every sequence number for gaps — which is why ingest performance under load, not on a bench, is the real test of a surveillance platform. The bitrate itself — and therefore the storage bill, since a VMS records the stream as-is — is governed by the codec on the camera, the subject of how surveillance storage works.

A common mistake to avoid

The costliest pattern we see is treating the transport as plug-and-play and discovering the gaps in production, and it has four faces. First, leaving the stream on UDP across a firewall, NAT, or the internet: it connects on the bench and fails on site — move to TCP interleaved, or HTTP tunneling, for any link that crosses a firewall. Second, believing "ONVIF" replaces RTSP: ONVIF only sets up the call, and the video still rides RTSP/RTP, so an ONVIF problem and a streaming problem are different problems with different fixes. Third, ignoring RTCP: the loss and jitter reports are the early-warning system, and a VMS that does not surface them is flying blind. Fourth, assuming a lost packet costs one frame: on UDP, losing one fragment of a keyframe can corrupt seconds of video, which is why hostile networks need TCP's reliability despite its latency. None of these is exotic; all four are predictable, and all four are cheaper to design around than to debug after deployment.

Where Fora Soft fits in

Fora Soft has built real-time video, streaming, and computer-vision software since 2005, across 625+ shipped projects, and the wire is where surveillance products quietly succeed or fail under load. The hard part is never one camera on a clean bench; it is a few hundred cameras from several manufacturers, some textbook-conformant and some that mishandle a TEARDOWN or stop sending RTCP, all of which must stream reliably, reconnect after a network blip, and degrade gracefully when packets drop. We build that layer — the RTSP/RTP pull with an automatic UDP-to-TCP-to-HTTP fallback, sequence-number-aware reassembly, and RTCP-driven health monitoring that flags a degrading camera before an operator sees a frozen tile. We lead with how the pipeline behaves on the worst network day — the packet storm, the half-open session — then the feature list, because an ingest layer that survives a bad network beats one that demos well on a quiet LAN.

What to read next

Call to action

References

  1. IETF — "RFC 2326: Real Time Streaming Protocol (RTSP)" (RTSP "acts as a 'network remote control' for multimedia servers", §1.1; required methods OPTIONS/SETUP/PLAY/TEARDOWN, §10; default port 554 over TCP or UDP, §3.2; stateful, unlike HTTP, §1.1; interleaved binary framing — $ (24 hex) + one-byte channel + two-byte length, §10.12; client/server state machine INIT/READY/PLAYING, Appendix A). Primary (tier 1). https://www.rfc-editor.org/rfc/rfc2326.txt
  2. IETF — "RFC 3550: RTP, A Transport Protocol for Real-Time Applications" (fixed 12-byte header, §5.1; sequence number "increments by one for each RTP data packet sent... to detect packet loss and to restore packet sequence"; timestamp "reflects the sampling instant of the first octet"; SSRC synchronization source; five RTCP packet types SR=200/RR=201/SDES=202/BYE=203/APP=204; "the fraction of the session bandwidth added for RTCP be fixed at 5%", §6.2; "RECOMMENDED value for a fixed minimum interval is 5 seconds", §6.2). Primary (tier 1). https://www.rfc-editor.org/rfc/rfc3550.html
  3. IETF — "RFC 7826: Real Time Streaming Protocol Version 2.0" (RTSP 2.0, December 2016; "obsoletes RTSP version 1.0 defined in RFC 2326"; Internet Standards Track; most field cameras still implement RTSP 1.0). Primary (tier 1). https://www.rfc-editor.org/rfc/rfc7826.html
  4. IETF — "RFC 7798: RTP Payload Format for High Efficiency Video Coding (HEVC)" (March 2016; three packetization structures — single NAL unit packet, aggregation packet (AP), fragmentation unit (FU); APs bundle small NAL units, FUs fragment a NAL unit larger than the MTU across packets). Primary (tier 1). https://www.rfc-editor.org/rfc/rfc7798.txt
  5. IETF — "RFC 6184: RTP Payload Format for H.264 Video" (May 2011; analogous packetization — single NAL unit packet, aggregation packets STAP/MTAP, fragmentation units FU-A/FU-B). Primary (tier 1). https://www.rfc-editor.org/rfc/rfc6184.txt
  6. ONVIF — "ONVIF Streaming Specification" (media streaming options "all based on RTP [RFC 3550]"; "media control is accomplished over RTSP as defined in RFC 2326"; transport options include RTP/UDP, RTP/RTSP/TCP, and RTP/RTSP/HTTP/TCP; RTCP usage mandated). Primary (tier 1). https://www.onvif.org/specs/stream/ONVIF-Streaming-Spec.pdf
  7. ONVIF — "Profile T" (advanced video streaming; H.264 and H.265 encoding, HTTPS streaming; the 2026 streaming profile that adds HEVC). Primary (tier 1). https://www.onvif.org/profiles/profile-t/
  8. ITU-T — "H.265: High Efficiency Video Coding (HEVC)" (the codec standard whose compressed bitstream the RTP payload carries; ITU-T H.264/AVC its predecessor). Primary (tier 1). https://www.itu.int/rec/T-REC-H.265
  9. OpenEye — "How Does Adding a Camera as ONVIF Differ From RTSP" (operational view: ONVIF discovers and configures the device while RTSP is the stream the VMS plays; adding by RTSP forgoes ONVIF discovery, events, and configuration). Institutional / engineering orientation (tier 5). https://answers.openeye.net/Troubleshooting/Frequently_Asked_Questions/Cameras/How_Does_Adding_a_Camera_as_ONVIF_Differ_From_RTSP
  10. Wowza — "RTP vs RTSP: control and transport layers" (RTSP is the control channel; RTP/RTCP carry media on dynamically negotiated ports; TCP interleaving for firewall traversal; UDP-first with TCP fallback). Vendor engineering (tier 4). https://www.wowza.com/blog/rtp-vs-rtsp-the-difference-between-streamings-control-and-transport-layers