OpenPose, MediaPipe Pose, RTMPose, ViTPose — The Pose-Tracking Stack For Video In 2026

Why This Matters

If your product attaches a webcam or a CCTV stream to a body — a fitness app that grades a squat, a telemedicine consultation that measures range of motion, a sports analytics dashboard that compares two tennis serves, a surveillance system that flags a fall or a fight, an e-learning platform that scores a dance lesson, a retail analytics tool that counts how many shoppers reach for the top shelf, a kids-yoga app that turns the practice into a game — you will eventually need a skeleton overlay. Pose tracking is the technology that puts the seventeen, thirty-three, or one hundred and thirty-three dots on a person's body and connects them with the right line segments. The technology has been around since 2017, but the model choices, the deployment patterns, the latency budgets, and the privacy implications have all changed since 2024. The same problem that took a research GPU and a six-figure budget in 2018 — count squats inside a phone app — runs inside the browser tab today for free. The same problem that needed a custom-trained model in 2020 — flag a fall in a retirement-home corridor — runs on a fifty-dollar edge box today using a publicly available checkpoint. This article shows you which model to pick, where to run it, and how to avoid the four failure modes that sink ninety percent of production deployments.

What Pose Tracking Actually Is — And Why The Skeleton Matters

The body has joints. The head connects to the neck, the neck to the shoulders, the shoulders to the elbows, the elbows to the wrists. A pose-tracking model takes a webcam frame as input and emits the pixel coordinates of those joints, plus a confidence score for each one. The dots are called keypoints. The line segments that connect them are called the skeleton. The whole representation is a deliberate compression of the human body into a small, low-dimensional, structured object that downstream code can reason about cheaply.

Why bother compressing? Because the original pixels are heavy and the joint coordinates are light. A 1280 × 720 webcam frame is about three megabytes of RGB data. A pose estimate of seventeen keypoints in two dimensions is sixty-eight floats, plus seventeen confidences — about three hundred bytes. The skeleton is roughly ten thousand times smaller than the frame. That compression is what makes pose-based features practical: every downstream algorithm — fall detector, squat counter, gesture recognizer, sign-language translator, ergonomic-risk scorer — works on those three hundred bytes instead of those three megabytes, and so runs cheaply, runs offline, runs in the browser.

The number of keypoints depends on the model family. The original COCO Keypoint benchmark, which the academic field has standardized around since 2017, defines seventeen keypoints — five on the head (nose, two eyes, two ears), six on the upper body (two shoulders, two elbows, two wrists), and six on the lower body (two hips, two knees, two ankles). MediaPipe extends that to thirty-three keypoints by adding hands, feet, and additional facial points. RTMPose's whole-body variant, RTMW, extends it further to one hundred and thirty-three keypoints by adding hand-finger joints and dense facial landmarks. More keypoints are not always better — they cost more compute, they make the model harder to train, and most downstream features only use eight to twelve of them. We will come back to this when we choose a model.

The output of a pose model also splits along a second axis: two-dimensional versus three-dimensional. A two-dimensional keypoint lives at a pixel position (x, y) inside the input frame. A three-dimensional keypoint lives in a coordinate system anchored to the person's hip, with (x, y, z) measured in body-relative units; it tells you not just where the joint is on screen but how far it sticks out toward or away from the camera. Three-dimensional pose is what fitness apps need to grade a side-lunge that the camera sees mostly from the side, and what telemedicine apps need to measure shoulder abduction. Two-dimensional pose is enough for fall detection, gesture recognition, and surveillance — the difference matters less than people think.

Diagram showing a webcam frame on the left and its compressed skeleton representation on the right. The frame is labelled Figure 1. Pose tracking is a compression step. The model turns three megabytes of pixels into three hundred bytes of structured joint coordinates that downstream features can reason about cheaply.

The Four Families Of Pose Model In 2026

Almost every production deployment in 2026 picks one of four model families. We walk through them in chronological order, because each one solved a problem that the next one inherited.

OpenPose — the 2017 system that defined the vocabulary

OpenPose was published at CVPR 2017 by Zhe Cao and colleagues at Carnegie Mellon University. It was the first system that could find every person in a frame, label every joint, and connect the joints into separate skeletons even when the people overlapped — what the field calls multi-person bottom-up pose estimation. The system worked by first running a heatmap network that, for each of seventeen joint types, produced a per-pixel probability that the pixel was that joint, and then running a second network that, for each pair of joints, produced a per-pixel two-dimensional vector — a part affinity field — pointing from one joint to the next along the limb. The decoder then assembled the joints into skeletons by walking the part affinity fields.

OpenPose is the model that taught the field its vocabulary. Almost every paper since 2017 cites it. The original implementation runs at 8 to 22 frames per second on an NVIDIA TITAN X GPU, depending on the input resolution, and at about 80 percent average precision on the COCO Keypoint test set. The code lives at github.com/CMU-Perceptual-Computing-Lab/openpose and is still maintained.

In 2026, the practical reason to know about OpenPose is twofold. First, it is the model that the academic literature compares everything else against — when a 2024 paper says "we beat OpenPose by 12 points", that is the baseline they mean. Second, the OpenPose license is non-commercial — Carnegie Mellon distributes the model under terms that prohibit commercial use without a separately negotiated license. This is the single most important compliance fact about OpenPose, and it is missed often enough that we have had to rebuild three pose features for clients who shipped OpenPose into their commercial products without a license. Do not ship OpenPose in a commercial product without a Carnegie Mellon licensing agreement. Use any of the other three families instead. They are all permissively licensed.

MediaPipe Pose (BlazePose) — the mobile-first model that runs in the browser

MediaPipe Pose, also known as BlazePose, was published by Google Research in 2020. It was the first pose model designed for mobile from the first line of code — every architectural decision serves a single constraint: hit 30 frames per second on a Pixel-class phone. Where OpenPose runs a heavy heatmap network on the full frame, BlazePose runs a lightweight detector that first finds a single person, crops to that person's bounding box, and then runs a small regression network that emits thirty-three keypoints directly as (x, y, z, visibility) tuples. The two-stage design is the same pattern that the MediaPipe family uses for face detection and for hand tracking, and it scales gracefully from the Pixel 2 (above thirty frames per second on CPU) to the modern flagship (super-real-time on GPU).

BlazePose's thirty-three keypoint set extends the COCO seventeen by adding palm centres and palm-base points on each hand, heel and toe points on each foot, and additional facial points. The additional keypoints are what let MediaPipe Pose drive a fitness app — you can tell a half-squat from a full squat because the model can see the heel, and you can tell a closed fist from an open palm because the model can see the palm base.

The COCO accuracy of BlazePose is around 78 percent in the latest published variant, sitting below the GPU-class models but well above the threshold for production fitness, yoga, and gesture-recognition features. The model ships natively as a Tasks API in the @mediapipe/tasks-vision JavaScript package — the same package we recommended in the background-blur lesson — and runs in WebAssembly with WebGPU acceleration on Chrome, Edge, and Safari 17.4 and later.

In 2026, MediaPipe Pose is the default choice for any in-browser, in-app, or on-device feature. It is permissively licensed, it ships with first-party JavaScript bindings, and it does not need a server.

RTMPose — the OpenMMLab production workhorse

RTMPose was published in March 2023 by the OpenMMLab team at Shanghai AI Lab. The paper sets out to answer a single question: can a pose model reach state-of-the-art accuracy while running fast enough on a CPU to be deployed everywhere? The answer turned out to be yes, and the trick was to replace heatmaps with a different output representation. Where OpenPose and ViTPose emit a per-pixel heatmap per joint and then take the heatmap's argmax, RTMPose uses Simple Coordinate Classification (SimCC), which discretizes the image into two sets of one-dimensional bins — one for x, one for y — and asks the network to classify which bin each joint falls into. The output is then two soft probability distributions per joint, which the decoder turns into floating-point coordinates by taking a weighted sum.

The change matters because heatmaps are expensive — a 256 × 192 input frame becomes 64 × 48 heatmaps per joint, which is millions of floats — while a SimCC output is two one-dimensional vectors per joint, which is hundreds of floats. RTMPose-m, the medium variant, reaches 75.8 percent COCO average precision at 90+ frames per second on an Intel i7-11700 CPU and 430+ frames per second on an NVIDIA GTX 1660 Ti GPU. RTMPose-s, the small variant, hits 72.2 percent AP at 70+ frames per second on a Snapdragon 865 phone chip. The whole-body variant, RTMW, reaches 67.0 percent AP on the COCO-WholeBody benchmark — which includes hand and face keypoints, 133 points total — at 130+ frames per second.

The numbers above are the reason RTMPose is the default choice in 2026 for any server-side or edge deployment that does not need a GPU. The model ships in the OpenMMLab MMPose framework under the Apache 2.0 license; the GitHub repository at github.com/open-mmlab/mmpose/tree/main/projects/rtmpose includes ONNX, TensorRT, and TorchScript exports. The same repository also ships a pretrained whole-body model (RTMW) and a three-dimensional model (RTMW3D).

ViTPose — the Vision Transformer baseline that crosses 80 points on COCO

ViTPose was published at NeurIPS 2022 by Yufei Xu and colleagues, with extensions in ViTPose++ in 2024. The model is deliberately simple: take a plain Vision Transformer — the same architecture you would use for image classification — pretrain it on the ImageNet-22K dataset, then attach a lightweight decoder that produces per-joint heatmaps. The architectural conservatism is the point: ViTPose set out to demonstrate that the structural priors of older convolutional pose models (multi-stage refinement, part affinity fields, dedicated heatmap heads) were not needed if the backbone was good enough.

It worked. ViTPose-H, the largest variant in the original paper, reaches 80.9 percent AP on the MS COCO Keypoint test-dev set; an ensemble of ViTPose variants reaches 81.1 percent, which was the state of the art at publication and remains within a point of the best published numbers in 2026. The smaller variants are useful too: ViTPose-S reaches 75.8 AP, ViTPose-L reaches 78.3 AP. We covered the Vision Transformer architecture itself in the upcoming ViT primer lesson; the short version is that the transformer's global self-attention captures long-range body-part dependencies that CNN receptive fields miss, which is exactly the property pose estimation needs.

ViTPose is the right pick when you have a GPU and you need every accuracy point you can get — sports analytics, broadcast contribution, medical motion-capture systems. The model is heavier than RTMPose and would not fit a CPU budget, but on a single A10 or L4 GPU it runs at 30 to 60 frames per second depending on the variant.

Diagram comparing the four pose model families on three axes: accuracy on COCO Keypoint, throughput in frames per second, and target deployment topology. OpenPose sits at 80 AP, 8-22 FPS GPU, with a Figure 2. The four pose model families occupy four corners of the accuracy-speed-deployment space. The right pick depends on where the model has to run, not on which one has the highest paper number.

The Pipeline — Detection, Pose, Tracking, Smoothing

A production pose system is more than one model. The full pipeline has four stages, each of which can be skipped or simplified depending on the use case.

The first stage is person detection. Most production pose models — MediaPipe, RTMPose, ViTPose — are top-down: they expect a cropped image of one person and they emit a skeleton inside that crop. Before they can run, something has to find the people. The detector is a lightweight object detector trained to find the person class only; in 2026 the default choice is RTMDet-tiny from the same OpenMMLab project, or a YOLOv8-person variant from the YOLO production lineage. The detector outputs one bounding box per person; the pose model runs once per box. (OpenPose is the only mainstream bottom-up exception — it finds all joints first and then assembles them into people, so it does not need a separate detector.)

The second stage is the pose model itself — one of the four families above. It takes a person crop and emits keypoints.

The third stage is tracking. A skeleton at frame 1 and a skeleton at frame 2 are unrelated objects unless something connects them — the multi-object tracker needs to assign the same track_id to the same person across frames so that downstream code can reason about a single trajectory. Modern systems use ByteTrack or OC-SORT, applied either to the person bounding boxes from stage 1 or to the keypoints directly. We cover the tracking algorithms in detail in the multi-object tracking lesson; for pose-specific use cases, ByteTrack on bounding boxes is the default.

The fourth stage is smoothing. Raw per-frame keypoint estimates jitter. The same wrist, held still, will move three or four pixels frame-to-frame because the model is sensitive to noise in the input. Jitter destroys downstream features — a squat counter that wants to detect a hip-knee angle crossing 90° will fire and unfire dozens of times in a single rep if the hip and knee bounce. Production systems apply a temporal filter — typically a One-Euro filter or a Kalman filter — to each keypoint independently. The One-Euro filter is the default because it tracks fast motion without lag while suppressing low-amplitude jitter; the official MediaPipe Pose pipeline applies it server-side as the last step.

Skipping any of the four stages is allowed when the use case allows it. A browser-side fitness app that tracks one person on a yoga mat does not need stage 1 (the detector) because MediaPipe Pose already includes a person detector in its first-stage network, and does not need stage 3 (the tracker) because there is only one person. A surveillance system that watches a corridor needs all four. Match the pipeline to the deployment.

Numeric example — the cost of running pose on a 1280 × 720 video at 30 frames per second

Here is the math for a single 1280 × 720 stream at 30 frames per second on a modern Intel server CPU.

Detection (RTMDet-tiny, 320 × 320 input): about 6 milliseconds per frame. Pose (RTMPose-m, 256 × 192 input, one person): about 8 milliseconds per frame. Tracking (ByteTrack on bounding boxes): about 0.3 milliseconds per frame. Smoothing (One-Euro filter, 17 keypoints): about 0.05 milliseconds per frame.

Total per frame: 14.35 milliseconds.

Per second: 14.35 × 30 = 430.5 milliseconds of CPU time.

A single CPU core has 1000 milliseconds available per second, which means one stream uses about 43 percent of one core, leaving headroom for the rest of the application. Two streams need a second core. Sixteen streams need a sixteen-core server. The cost on a 2026 cloud instance (e.g., AWS c7i.4xlarge at sixteen vCPUs, about $0.71/hour on demand) is about $0.045 per stream-hour. That number is the one that matters in the budget conversation — every other component of the pose feature (storage, downstream features, UI) is small compared to the model compute.

If you need more streams per server, the move is not to switch to a smaller model; it is to drop the pose frame rate. A squat counter does not need 30 frames per second of pose — 10 frames per second is enough, because the angular velocities involved are slow. Dropping from 30 to 10 fps cuts the cost by 3×.

Where To Run It — Browser, Edge, Or Server

The deployment topology decision dominates everything else, because where the model runs determines the cost, the latency, the privacy posture, and the offline behaviour of the feature. The same model can be deployed in three different places, with three very different trade-offs.

Browser deployment

The pose model runs inside the user's browser tab, in WebAssembly with WebGPU acceleration. MediaPipe Pose is the only model in the four families that is shipped with first-party browser support, via the @mediapipe/tasks-vision package; the others can be run in the browser via ONNX Runtime Web, but with worse tooling and more friction. Cost to you: zero — the user pays the compute. Latency: lowest possible, because the frame never leaves the device. Privacy: the frames never leave the device, which removes most of the GDPR Article 9 surface area for fitness, yoga, and gesture-tracking apps. Trade-off: you are limited to whatever the user's hardware can do, you cannot use a model larger than about 30 megabytes (the practical limit for a page that has to load in a few seconds), and you cannot run multiple people on a budget device.

This is the default for fitness apps, yoga apps, gesture-controlled web games, and any browser-based telemedicine consultation that does not also do face recognition.

Edge deployment

The pose model runs on a small box near the camera — a NVIDIA Jetson Orin Nano, an Intel NUC, or a Raspberry Pi 5 with a Coral USB accelerator. RTMPose is the default model here because it hits 70+ FPS on a Snapdragon 865 and similar numbers on the Jetson Orin Nano's GPU. Cost to you: the cost of the edge box, amortized over its life (typically 3 to 5 years). Latency: low (single-digit milliseconds, plus the LAN round-trip if a server is involved in the response). Privacy: frames stay on the local network, which is the right posture for surveillance, retail analytics, and industrial settings where the customer does not want to send raw video to the cloud. Trade-off: you have to ship and manage hardware, which is operationally heavier than a cloud-only deployment.

This is the default for AI security cameras, intelligent video analytics, retail in-store analytics, and industrial computer-vision deployments.

Server deployment

The pose model runs on a GPU or a CPU in your cloud account, fed by streams from the cameras. RTMPose is the default for CPU; ViTPose is the default for GPU. Cost to you: the cloud bill, scaling linearly with the number of streams. Latency: dominated by the network round-trip (typically 30 to 100 ms) plus the model time. Privacy: frames flow to your cloud, which means you operate them as personal data under the GDPR (for biometric-identifying use cases like gait recognition) or as ordinary personal data (for anonymous use cases like crowd counting). Trade-off: full operational flexibility, but you pay for every minute of every stream.

This is the default for sports analytics platforms, broadcast contribution, batch processing of recorded footage, and any feature that needs the largest models.

The choice between the three is rarely close. The use case picks the topology, and the topology picks the model.

Topology	Default model	Cost owner	Privacy posture	Latency budget
Browser	MediaPipe Pose	User's device	Frames stay on device	16 ms (60 FPS budget)
Edge	RTMPose-m / -s	One-time hardware	Frames stay on LAN	50 ms (round-trip + model)
Server	RTMPose-m (CPU) / ViTPose-L (GPU)	Cloud bill per stream	Frames travel to cloud	100 ms (RTT + model)

We treat that table as the first thing a project picks. Everything else — model variant, batch size, frame rate — is a refinement on top.

Figure 3. The same pose model can ship into three different deployment topologies. The topology picks the model, not the other way around.

Four Production Pitfalls That Sink Pose Features

We have shipped a lot of pose features into video products at Fora Soft. Four failure modes show up over and over again. They are not technical novelties — they are well-known traps that every production team rediscovers — but every team rediscovers them by going through the same painful debugging cycle. Here they are.

The first pitfall is shipping OpenPose into a commercial product without the Carnegie Mellon license. OpenPose is the model most teams learned pose on, because it is the one every tutorial uses, and the GitHub repository's README does not put the non-commercial clause at the top of the page. The license clause is on the third paragraph of the LICENSE file. We have rebuilt three pose features for clients who discovered the clause during a due-diligence audit eighteen months after launch. Default to MediaPipe Pose, RTMPose, or ViTPose for any commercial deployment. The 2 to 5 point accuracy difference does not justify the legal exposure.

The second pitfall is forgetting to smooth. Raw per-frame keypoint output jitters at three to four pixels even when the subject is still. Downstream features that involve threshold crossings — a squat counter that watches the hip-knee angle, a fall detector that watches the head height, a "hand raised" trigger that watches the wrist position — fire spuriously dozens of times per second on raw output. The fix is one One-Euro filter per keypoint, applied as the last step of the pipeline. It is fifteen lines of code and it transforms every downstream metric.

The third pitfall is running the model on every frame instead of every fourth frame. Most pose-driven features do not need 30 frames per second of pose. A squat counter, a fall detector, a posture-correction reminder, a hand-raise trigger — all of them work fine at 7 to 10 frames per second. Running the model on every fourth frame and interpolating the skeleton between frames cuts the compute by 75 percent with no perceptible drop in feature quality. Teams ship at 30 fps because that is the camera's native rate, and then watch their cloud bill scale linearly with no benefit.

The fourth pitfall is assuming three-dimensional keypoints from a single camera are accurate enough to do clinical-grade measurements. They are not. The three-dimensional coordinate that MediaPipe Pose, RTMPose-3D, or ViTPose-3D emits is a monocular depth estimate — the model is guessing how far each joint sticks out toward or away from the camera, based on patterns it learned in training. The estimate is accurate enough to grade a squat or a yoga pose, where the joint depth has to be roughly correct. It is not accurate enough to measure a shoulder abduction angle for a physiotherapy report. If you need clinical-grade three-dimensional measurements, you need a multi-camera rig (typically two to four synchronized cameras) and a triangulation pipeline. The single-camera monocular estimate is a good user-experience feature, not a measurement instrument.

Diagram listing the four pitfalls as four red callout cards arranged in a two-by-two grid. The cards are titled Figure 4. The four pitfalls that sink pose-tracking deployments. None of them is technically new; all of them are still rediscovered by every team.

Picking The Right Model — A Decision Walkthrough

Here is the decision tree we use at Fora Soft when a project asks for a pose feature. It runs in five questions.

The first question is: is this a commercial product? If yes, do not use OpenPose. The non-commercial license forecloses it. Move to one of the other three families.

The second question is: where does the model run? If the answer is "in the browser, on the user's device, for a fitness or yoga or gesture-control feature", the answer is MediaPipe Pose. The first-party JavaScript bindings, the 33-keypoint output, and the WebGPU acceleration make it the only sensible choice for the browser case in 2026. If the answer is "on a small box near the camera, for surveillance or retail or industrial computer vision", the answer is RTMPose-m or RTMPose-s on the Jetson Orin Nano or equivalent. If the answer is "on a server in our cloud, batch or streaming", continue to question three.

The third question is: how many concurrent streams? If the answer is "more than ten per server", RTMPose-m on CPU is the right pick — 90+ frames per second per core means a 16-core server handles dozens of streams. If the answer is "fewer than ten per server, but every accuracy point matters", ViTPose-L on a single GPU is the right pick.

The fourth question is: do you need whole-body keypoints? If the downstream feature uses hand fingers (sign-language translation), dense facial landmarks (subtle emotion or attention features), or the full 133-point output, RTMW (RTMPose Whole-body) is the default. If the downstream feature uses only the COCO 17 or the BlazePose 33 (squat counting, fall detection, gesture triggers), do not pay for the bigger model.

The fifth question is: do you need real three-dimensional measurements? If yes, you need a multi-camera triangulation rig, not a single-camera model. If "approximately three-dimensional, for UX feedback" is enough, MediaPipe Pose's 3D mode or RTMW3D is fine.

That decision tree picks the model in about ninety seconds. The rest of the engineering — the detector pick, the tracker pick, the smoothing-filter pick — is downstream of those five answers.

Where Fora Soft Fits In

We have shipped pose features into half a dozen product categories at Fora Soft. The patterns we use are the ones in this article. In video conferencing, we use MediaPipe Pose for browser-side gesture triggers — hand raises, wave-to-acknowledge — that do not need to know who the person is. In e-learning, we use MediaPipe Pose for dance and yoga lessons, where the model runs entirely in the student's tab. In telemedicine, we use RTMPose on the server for range-of-motion measurements, with a multi-camera rig when the consultation needs clinical-grade three-dimensional data. In OTT and broadcast contribution, we use ViTPose for sports analytics — broadcasters who want to overlay pose data on top of athlete footage have the GPU budget and the accuracy requirement. In surveillance and industrial computer vision, we use RTMPose on Jetson edge boxes for fall detection, fight detection, and ergonomic-risk scoring. The model choice is always downstream of the deployment topology and the use-case; the architecture decisions described in this article are the ones we make at every project kickoff.

Call to action

Talk to a video engineer — book a 30-minute scoping call to talk through your openpose plan.
See our case studies — 250+ shipped projects across video streaming, WebRTC, OTT, telemedicine, e-learning, surveillance, and AR/VR.
Download the Pose-Tracking Model Decision Worksheet — One-page printable worksheet with the five-question decision tree, the deployment-topology table (browser / edge / server with default model and cost owner), and the four-pitfall checklist.

References

Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., Sheikh, Y. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields. IEEE TPAMI, 2019. arXiv:1812.08008. The original OpenPose paper. The license file at github.com/CMU-Perceptual-Computing-Lab/openpose/blob/master/LICENSE is the authoritative source for the non-commercial clause.
Bazarevsky, V., Grishchenko, I., Raveendran, K., Zhu, T., Zhang, F., Grundmann, M. BlazePose: On-device Real-time Body Pose tracking. arXiv:2006.10204, 2020. The original BlazePose paper from Google Research.
Google AI Edge. MediaPipe Pose Landmarker — model card and Tasks API documentation. ai.google.dev/edge/mediapipe/solutions/vision/pose_landmarker, accessed 2026-05-24. The first-party documentation for the JavaScript Tasks API used in browser deployments.
Jiang, T., Lu, P., Zhang, L., et al. RTMPose: Real-Time Multi-Person Pose Estimation based on MMPose. arXiv:2303.07399, 2023. The RTMPose paper with the COCO and CPU benchmark numbers cited in this article.
OpenMMLab. MMPose 1.x — RTMPose project README and model zoo. github.com/open-mmlab/mmpose/tree/main/projects/rtmpose, accessed 2026-05-24. The maintained source of the ONNX, TensorRT, and TorchScript exports cited in the deployment section.
Jiang, T., Xu, X., et al. RTMW: Real-Time Multi-Person 2D and 3D Whole-body Pose Estimation. arXiv:2407.08634, 2024. The RTMPose whole-body and 3D variants.
Xu, Y., Zhang, J., Zhang, Q., Tao, D. ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation. NeurIPS 2022. arXiv:2204.12484. The ViTPose paper with the 80.9 percent COCO test-dev number.
Xu, Y., Zhang, J., Zhang, Q., Tao, D. ViTPose++: Vision Transformer for Generic Body Pose Estimation. arXiv:2212.04246, 2024. The ViTPose++ extension covering multi-task pose.
Lin, T.-Y., Maire, M., Belongie, S., et al. Microsoft COCO: Common Objects in Context. ECCV 2014. The keypoint annotation extension defines the 17-keypoint standard the field uses as a benchmark.
Zhang, F., Bazarevsky, V., Vakunov, A., et al. MediaPipe Hands — On-device real-time hand tracking. arXiv:2006.10214, 2020. Background on the two-stage detect-then-track pattern that MediaPipe Pose inherits.
Casiez, G., Roussel, N., Vogel, D. 1€ Filter — A Simple Speed-based Low-pass Filter for Noisy Input in Interactive Systems. CHI 2012. The One-Euro filter used as the smoothing default in stage 4 of the pipeline.
Zhang, Y., Sun, P., Jiang, Y., et al. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. ECCV 2022. The default tracker for multi-person pose pipelines.