A B-frame is a video frame that's reconstructed by looking at both a previous frame and a future frame, picking the bits of each that resemble it most and storing only the small difference. The "B" stands for bidirectional. Because B-frames can borrow from two reference frames, they typically need only 25–30 % of the storage of a P-frame (which looks backward only), and even less compared to a full I-frame.
B-frames are why your average movie file is small. A typical compressed video might use one I-frame every 2 seconds, some P-frames for forward changes, and B-frames in between to fill in the gaps cheaply. The encoder is essentially saying "to draw this frame, take the man's silhouette from the previous frame, blend it with the new background from the next frame, here's the small fixup". For most natural-motion content, this is dramatically more efficient than describing each frame from scratch.
The trade-off is latency. To decode a B-frame, the player must have already received the future reference frame too — which means the encoder has to wait before producing it. That's fine for video-on-demand (you encode once, latency doesn't matter), and acceptable for most live streaming (a few hundred ms is rarely noticed). But it's why ultra-low-latency systems like WebRTC video calls, cloud gaming and sports gambling streams either disable B-frames entirely or use only a few — every B-frame in flight is extra glass-to-glass delay.

