Why this matters
Judder and stutter are the artifacts that a still frame will never reveal and a per-frame quality score will happily pass. You can ship a clip with a flawless mean VMAF and a panning shot that stutters badly enough to generate support tickets, because the number never looked at when the frames played, only at what was in them. This article is for a video engineer, streaming lead, or QA engineer who has watched a render judder on a pan, or seen a video call hitch, and wants to know exactly what is happening in the time dimension, why their objective metric stayed silent, and which tools actually measure it. Get this right and you stop trusting a single picture-quality number to certify motion it was never built to see.
What judder and stutter look like
Every artifact in this gallery so far — blocking, banding, ringing, blur — lives inside a single frame. Pause the video and you can point at it. Temporal artifacts are different: they exist only between frames, in motion, and they vanish the instant you hit pause. That is exactly why they are so easy to miss and so hard to measure.
Judder is uneven motion caused by frames being displayed for unequal lengths of time. Picture a camera panning smoothly across a skyline. If every frame is held on screen for the same number of milliseconds, the motion looks fluid. If some frames linger and others flash past, the same pan develops a recurring stutter-step — a subtle "tug" in the motion that repeats several times a second. The content is moving at a constant speed; the display timing is not, and your eye, which tracks the motion smoothly, sees the mismatch as jerkiness.
Stutter is the related hitch you get when a frame is missing or repeated. Where judder is a regular, repeating unevenness baked into a cadence, stutter is an irregular lurch: a frame the player should have shown on time arrives late, gets dropped, or is held for an extra beat, and the motion jumps or briefly freezes. A common sign is an object that seems to hesitate, or even appear to jump back, for a single beat before catching up.
A third member of the family runs the opposite way. The soap-opera effect is the unnaturally smooth, "video-like" motion a TV produces when it invents extra frames by motion interpolation to fill a high-refresh panel. It is not a loss of frames but a synthesis of fake ones, and it brings its own artifacts — we return to it below. All three share one trait that defines this article: they are about the timing and number of frames, not the pixels inside any one of them.
Figure 1. What judder is. The same pan, sampled at a constant 24 frames per second, displayed on a 60 Hz screen. Because 60 does not divide evenly by 24, the 3:2 cadence holds one frame for three refreshes (50 ms) and the next for two (33 ms). Equal motion, unequal display time — the eye reads the difference as judder.
Where judder comes from: frame-rate conversion
Judder is, almost always, the fingerprint of a frame-rate conversion that does not divide evenly. The classic case is showing 24-frames-per-second cinema on a 60 Hz screen.
Start with the arithmetic, because it is the whole story. A 60 Hz display refreshes 60 times a second, once every 1000 ÷ 60 ≈ 16.67 milliseconds. Film runs at 24 frames per second. To play 24 frames across 60 refreshes you would need each frame to occupy 60 ÷ 24 = 2.5 refreshes — and you cannot show a frame for half a refresh. So the system rounds, alternately, in a pattern called 3:2 pulldown (or 2:3 pulldown): hold the first frame for 3 refreshes, the next for 2, the next for 3, the next for 2, and so on. Three refreshes is 3 × 16.67 = 50 milliseconds; two refreshes is 2 × 16.67 = 33.3 milliseconds. Over each pair of frames the motion is displayed slightly slow, then slightly fast — and that 50-vs-33 unevenness, repeating about twelve times a second, is judder.
The name comes from the original analog process. Converting 24 fps film to the 29.97 fps of NTSC television, a telecine first slows the film by one part in a thousand to 23.976 fps (imperceptible — a two-hour film runs about 7 seconds long), then distributes each four film frames across five interlaced video frames, ten fields, in a 3-2-3-2 field cadence (Poynton, Digital Video and HDTV, 2003). Four frames in, five frames out. The uneven cadence was acceptable on a tube TV and has haunted every 24-into-60 conversion since.
There is a cleaner-looking conversion that trades the judder for a different fault. In PAL regions, 24 fps film is often shown at 25 fps by simply running it 4% faster (25 ÷ 24 = 1.0417) with a one-frame-per-frame 2:2 cadence. The motion is even — no judder — but the whole film plays 4% short, and the audio pitch rises by about 0.68 of a semitone, the well-known "PAL speedup." It is the same lesson the other way around: when the source and display frame rates do not share a clean ratio, something has to give — even cadence, correct runtime, or correct pitch. You pick which.
Figure 2. The 3:2 pulldown cadence. Four 24 fps film frames (A, B, C, D) are shown across ten refreshes of a 60 Hz screen by holding A for 3 refreshes, B for 2, C for 3, D for 2. The repeating 3-2 pattern is what makes equal-speed motion arrive at the eye unevenly. (The classic NTSC telecine packs the same cadence into five interlaced video frames, ten fields.)
Where stutter comes from: dropped and late frames
Judder is designed-in unevenness from a frame-rate ratio. Stutter is unplanned unevenness from a frame that did not show up on time.
It happens wherever the pipeline has to keep a real-time beat. A camera or capture card under load drops frames at the source, so the recording itself has gaps. An encoder configured for a frame rate the hardware cannot sustain duplicates or drops frames to keep up. On the playback side, a player on a busy device misses a display deadline — heavy rendering, an underpowered decoder, a browser tab starved of CPU — and the previous frame is held for an extra refresh while the late one is skipped. The result is the same in every case: motion that should advance by one step instead stalls and then jumps two, a visible hitch.
It is worth drawing a sharp line here, because the two get confused. Stutter is not rebuffering. Rebuffering is when the player's buffer runs empty and playback stalls — the spinner, a full stop — because no data arrived. Stutter can happen with a full buffer and a perfect network: the frames are all present, but one is rendered late or dropped, so motion hitches without ever stopping. They are separate dimensions of the viewing experience, measured separately, and the delivery side of that story — stalls, startup time, the buffer — belongs to streaming QoE. Here we care about the picture's motion, not the pipe that fed it.
The soap-opera effect: the same dial, turned the other way
If judder and stutter are too few frames shown unevenly, the soap-opera effect is too many frames, invented. To drive a 120 Hz panel from 24 fps source, a TV's motion-interpolation feature estimates how objects move between two real frames and synthesizes in-between frames to fill the gap. Done well, motion looks impossibly smooth; done at all, it strips the 24 fps cadence that audiences have associated with "cinematic" for a century, which is why it looks like a daytime soap to many viewers.
Interpolation also fails in characteristic ways. When two objects cross, or something moves too fast for the motion estimator, the invented frame guesses wrong: you get a torn or smeared region, a faint halo around a moving object, or a flickering patch where the algorithm could not decide. These are artifacts of manufactured frames, the mirror image of the dropped-frame stutter.
This is also where the 24 fps "look" earns its keep, and where motion blur enters. Film shot with a 180-degree shutter exposes each frame for half the frame interval — 1/48 of a second at 24 fps — so fast motion records as a blur that bridges the gap between one frame's position and the next. That blur is what makes 24 fps watchable: it smooths the jumps. Shoot the same 24 fps with a very fast shutter and crisp, blur-free frames, and a fast pan judders hard, because nothing fills the gaps between the large positional jumps. The cause-side mechanics of frame rate, shutter, and motion blur live in the Video Encoding section; here the point is only that intentional 24 fps cadence is a creative choice a metric must not "correct," and that motion blur and judder trade off against each other.
Why a per-frame metric is blind to all of this
Here is the measurement heart of the article, and the reason temporal artifacts get their own piece. The objective metrics most teams rely on are per-frame metrics: they score each frame on its own and then average the scores. Judder and stutter are faults in time, not in any single frame — so these metrics are structurally blind to them.
Walk through why. The number that compares a compressed frame to the original pixel by pixel, called PSNR (Peak Signal-to-Noise Ratio, in decibels), takes two frames and returns one value. The structural metric, SSIM (Structural Similarity, 0–1), does the same. Both are defined on a single pair of images; neither has any notion of how long a frame stayed on screen or whether one was dropped. Run 3:2 pulldown judder through them and the pixels of every frame are still exactly right — the judder is in the display timing, which the metric never receives — so they report perfect quality.
VMAF — Netflix's fused perceptual metric (Video Multimethod Assessment Fusion, 0–100) — is the one most likely to be trusted here, and it does carry a temporal feature, so it is worth being precise about what that feature does. VMAF fuses three core features: VIF and ADM (both spatial, detail-and-structure features) and one temporal feature called Motion2. Netflix's own documentation describes Motion2 plainly: "a simple measure of the temporal difference between adjacent frames... the average absolute pixel difference for the luminance component." Read that carefully. It measures how much the content moves — high for a fast pan, low for a static shot — not whether the motion was displayed evenly. It is a descriptor of the scene's motion, not a detector of temporal distortion. A juddering pan and a smooth pan with the same content have nearly the same Motion2.
This is the same quantity that ITU-T P.910, the standard for subjective video methods, calls Temporal Information (TI) — the standard deviation over the frame of the pixel-by-pixel difference between successive frames, "a measure that indicates the number of temporal changes of a video sequence... higher for high motion sequences" (ITU-T P.910, 2023). TI and Motion2 are cousins, and both characterize content, not impairment. There is no widely deployed per-frame metric whose temporal term rises specifically because the cadence broke.
There is a second, quieter reason even a stutter that does register gets lost: pooling. A dropped frame creates real pixel error in the handful of frames around it — but that error is then averaged across hundreds of correct frames, and the mean barely moves. A 10-second, 60 fps clip is 600 frames; one ugly hitch is a fraction of a percent of the average. The pooling step that turns per-frame scores into one number dilutes the very moment a viewer would complain about. This is the same lesson that runs through where objective metrics lie: a single mean number is a summary, and temporal artifacts are exactly the kind of localized, time-based event a summary erases.
Common mistake: certifying motion with a per-frame mean. The expensive error is to run a panning or high-motion clip through PSNR or VMAF, see a high mean score, and sign off on the motion. The score validated the pixels of each frame, not the cadence they played at. A clip can read mean VMAF 96 and judder visibly on every pan. If frame rate, frame-rate conversion, or playback smoothness is in scope, do not let a per-frame metric speak for it: use a frame-rate-aware metric, inspect the cadence, and — for anything shipping — watch the motion. And never compare two clips at different frame rates on a per-frame metric as if the numbers were on one scale; they are not.
How to actually measure temporal quality
If the everyday metrics are blind to time, what does see it? Four honest options, in rough order of rigor.
Frame-rate-aware objective metrics. A small family of metrics was built specifically to score quality as frame rate changes. FRQM, the Frame Rate dependent Quality Metric (Zhang, Mackin, and Bull, ICIP 2017), applies a temporal wavelet decomposition to compare a low-frame-rate clip against its high-frame-rate reference, predicting the perceptual cost of the frame-rate reduction; it was validated on the BVI-HFR database up to 120 fps. ST-GREED, the Space-Time Generalized Entropic Difference (Madhusudana et al., IEEE TIP, 2021), models the statistics of spatial and temporal band-pass coefficients and compares their entropy between reference and distorted video, capturing quality changes that arise specifically from frame-rate differences, and reaches state-of-the-art correlation on the LIVE-YT-HFR database. These need a reference, but unlike PSNR/SSIM/VMAF they were designed to react to the time dimension.
Standardized streaming models. For adaptive streaming, ITU-T P.1204 — the model series for video quality of streaming up to 4K — explicitly handles frame rate. Its feature set includes windowed frame-rate measures and repeated-frame indicators, and the models are trained across frame rates from 15 to 60 fps (ITU-T P.1204, 2023). When a delivery chain changes frame rate, a P.1204-class model is built to account for it where a picture-only metric is not.
No-reference cadence and drop detection. On live, user-generated, or already-delivered content there is no pristine reference, so you measure the timeline directly: detect a 3:2 (or 2:2) pulldown cadence from the pattern of repeated frames, flag duplicated and dropped frames, and measure the inter-frame interval to find where display timing went uneven. This is cheap, runs without a reference, and tells you that the cadence is irregular — though, like every blind measure in this gallery, it cannot tell an intentional cadence from a fault.
Subjective testing — the ground truth. Because no objective metric fully captures perceived smoothness, motion quality is ultimately settled the way all quality is: by a properly run subjective test, with real viewers watching real motion under controlled conditions. When a frame-rate-aware metric and the eye disagree, the eye wins and the metric is the proxy that failed.
| Metric / method | What it measures | Reference needed | Where it lies on judder and stutter |
|---|---|---|---|
| PSNR | Mean pixel error per frame (dB) | Full-reference | Blind to timing — frames are pixel-correct under pulldown judder, so it reports no error |
| SSIM | Structural similarity per frame (0–1) | Full-reference | Same blind spot; single-frame by construction, no temporal term |
| VMAF (Motion2) | Fused perceptual score (0–100) | Full-reference | Its one temporal feature measures content motion, not cadence; mean-pooling further dilutes a stutter |
| FRQM / ST-GREED | Frame-rate-dependent quality | Full-reference | Built for frame-rate change; needs a high-frame-rate reference and lab validation |
| ITU-T P.1204 | Streaming MOS incl. frame rate | Bitstream/hybrid | Models windowed frame rate and repeated frames; scoped to its trained codecs and 15–60 fps |
| No-reference cadence/drop check | Pulldown pattern, dropped/repeated frames | No-reference | Flags irregular timing on live/UGC; cannot tell an intentional cadence from a fault |
Table 1. Six ways to treat temporal artifacts. The everyday per-frame metrics (PSNR, SSIM, VMAF) are blind to display timing because judder leaves each frame's pixels correct. The metrics that see frame rate (FRQM, ST-GREED, P.1204) were purpose-built for it; a no-reference cadence check works without a reference but cannot read intent.
Figure 3. The temporal blind spot. A per-frame metric scores the pixels of each frame against the reference. Under pulldown judder the pixels are exactly right, so PSNR/SSIM/VMAF read "excellent"; VMAF's Motion2 feature only measures how much the content moves. The fault — uneven display timing — is in a dimension these metrics never receive.
Figure 4. The temporal-artifact family. Judder is too-few frames held unevenly by a frame-rate ratio; stutter is a frame dropped or delayed in the pipeline; the soap-opera effect is extra frames invented by interpolation. All three are faults of frame timing and count — the dimension per-frame metrics do not model.
Where Fora Soft fits in
Fora Soft has built video software since 2005 — streaming, WebRTC conferencing, e-learning, OTT, telemedicine, and surveillance — and motion smoothness is the quality dimension that varies most by product. A WebRTC video call that stutters under network jitter fails differently from an OTT title that judders because a 24 fps master met a 60 Hz pipeline; a surveillance stream that silently drops frames can lose the one second that mattered. We treat temporal quality as its own measurement problem, separate from picture quality: per-frame metrics certify the frames, but cadence checks, dropped-frame counts, and frame-interval timing certify the motion, and on the delivery side we keep stutter (a rendering fault) distinct from rebuffering (an empty buffer). The fixes follow the cause — a clean frame-rate ratio end to end, enough headroom that the player never misses a deadline, and resisting the urge to "smooth" motion the creator shot at 24 fps on purpose.
What to read next
- Pooling: Turning Per-Frame Scores Into One Number
- Where Objective Metrics Lie: Content, Motion, and Edge Cases
- Why Subjective Testing Is the Ground Truth
Call to action
- Talk to a video engineer — book a 30-minute scoping call to talk through your judder plan.
- See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
References
- Recommendation ITU-T P.910, "Subjective video quality assessment methods for multimedia applications," International Telecommunication Union, 2023 (and prior editions). Tier 1 (official standard). Defines Temporal Information (TI) as the standard deviation over the frame of the successive-frame luminance difference — a measure of the amount of motion in the content, not of temporal distortion. Basis for the claim that the standard "motion" descriptor characterizes content, not impairment. https://www.itu.int/rec/T-REC-P.910
- Recommendation ITU-T P.1204, "Video quality assessment of streaming services over reliable transport for resolutions up to 4K," International Telecommunication Union, 2023. Tier 1 (official standard). Defines objective streaming-quality models trained across 15–60 fps for H.264/HEVC/VP9, whose features include windowed frame-rate measures and repeated-frame indicators. Basis for the "standardized streaming models account for frame rate" point. https://www.itu.int/dms_pubrec/itu-t/rec/p/T-REC-P.1204-202310-I!!SUM-HTM-E.htm
- F. Zhang, A. Mackin, and D. R. Bull, "A frame rate dependent video quality metric based on temporal wavelet decomposition and spatiotemporal pooling," Proc. IEEE International Conference on Image Processing (ICIP), pp. 300–304, 2017. Tier 1 (metric-author defining work). Defines FRQM, which uses a temporal wavelet transform to predict the perceptual quality cost of frame-rate reduction; validated on the BVI-HFR database up to 120 fps. Basis for the frame-rate-aware-metric section and Table 1. https://ieeexplore.ieee.org/document/8296291
- P. C. Madhusudana, N. Birkbeck, Y. Wang, B. Adsumilli, and A. C. Bovik, "ST-GREED: Space-Time Generalized Entropic Differences for Frame Rate Dependent Video Quality Prediction," IEEE Transactions on Image Processing, vol. 30, pp. 7446–7457, 2021. Tier 1 (metric-author defining work). Models the entropy of spatial and temporal band-pass coefficients to capture quality changes from frame-rate variation; state of the art on the LIVE-YT-HFR database. Basis for the ST-GREED description and Table 1. https://doi.org/10.1109/TIP.2021.3106801
- VMAF Features (Netflix/vmaf,
resource/doc/features.md), Netflix, accessed 2026-06-24. Tier 3 (first-party tooling / metric owner's documentation). States that VMAF's three core features are VIF, Motion2, and ADM, and that Motion2 is "a simple measure of the temporal difference between adjacent frames... the average absolute pixel difference for the luminance component." Basis for the precise description of what VMAF's temporal feature does and does not measure. https://github.com/Netflix/vmaf/blob/master/resource/doc/features.md - C. Poynton, Digital Video and HDTV: Algorithms and Interfaces, Morgan Kaufmann, 2003 (p. 430). Tier 5 (foundational reference). The authoritative account of 2:3/3:2 pulldown: the 23.976 fps slowdown and the distribution of four film frames across five interlaced video frames (ten fields) in a 3-2-3-2 cadence. Basis for the telecine arithmetic and the judder cadence. https://archive.org/details/digitalvideohdtv0000poyn
- "Three-two pull down," Wikipedia, accessed 2026-06-24. Tier 6 (orientation). Summarizes the 24→29.97/59.94 conversion, the 23.976 fps slowdown (one part in a thousand), and the 3-2-3-2 field cadence with its "dirty" blended frames. Orientation for the pulldown mechanics; the normative cadence is cited to Poynton above. https://en.wikipedia.org/wiki/Three-two_pull_down
- A. Mackin, F. Zhang, and D. R. Bull, "A study of subjective video quality at various frame rates" / the BVI-HFR database, University of Bristol Visual Information Laboratory, 2017–2018. Tier 5 (peer-reviewed/institutional). Subjective high-frame-rate database (up to 120 fps) on which FRQM was validated; documents that perceived quality depends on frame rate in ways per-frame metrics miss. Basis for the high-frame-rate validation claims. https://vilab.blogs.bristol.ac.uk/2018/08/frqm-a-frame-rate-dependent-video-quality-metric/
- "Soap opera effect" and "Motion interpolation," Wikipedia, accessed 2026-06-24. Tier 6 (orientation). Summarize frame-rate up-conversion: a TV synthesizes intermediate frames by motion estimation, producing very smooth motion plus halo/ghosting artifacts around fast or occluding objects. Basis for the soap-opera-effect section. https://en.wikipedia.org/wiki/Motion_interpolation
- "The 180-degree shutter rule," DIYPhotography / PolarPro filmmaking references, accessed 2026-06-24. Tier 6 (orientation). Explains that a 180-degree shutter exposes each frame for half the frame interval (1/48 s at 24 fps), so motion blur bridges the gap between frames and reduces perceived judder; a fast shutter removes the blur and increases judder. Basis for the shutter/motion-blur trade-off. https://www.diyphotography.net/180-degree-shutter-rule/


