SAM produces a precise pixel mask for whatever you indicate with a click or box, without per-class training. SAM 2 adds a memory so a mask follows the object across video frames, which is the basis for rotoscoping and region-of-interest pipelines.