Stitch-a-Demo: Video Demonstrations from Multistep Descriptions
About
When obtaining visual illustrations from text descriptions, today's methods take a description with a single text context - a caption, or an action description - and retrieve or generate the matching visual context. However, prior work does not permit visual illustration of multistep descriptions, e.g. a cooking recipe or a gardening instruction manual, and simply handling each step description in isolation would result in an incoherent demonstration. We propose Stitch-a-Demo, a novel retrieval-based method to assemble a video demonstration from a multistep description. The resulting video contains clips, possibly from different sources, that accurately reflect all the step descriptions, while being visually coherent. We formulate a training pipeline that creates large-scale weakly supervised data containing diverse procedures and injects hard negatives that promote both correctness and coherence. Validated on in-the-wild instructional videos, Stitch-a-Demo achieves state-of-the-art performance, with gains up to 29% as well as dramatic wins in a human preference study.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video demonstration retrieval | SaD-MC Cooking (test) | MR3 | 6 | |
| Video demonstration retrieval | SaD-VD Cooking (test) | MR3.5 | 6 | |
| Video demonstration retrieval | HT-Step Cooking (test) | Mean Rank (MR)40 | 6 | |
| Video demonstration retrieval | COIN CT (Cooking) (test) | MR6 | 6 | |
| Video demonstration retrieval | SaD-MC Woodworking (test) | MR24 | 6 | |
| Video demonstration retrieval | COIN CT Woodworking (test) | Mean Rank30 | 6 | |
| Video demonstration retrieval | SaD-MC Gardening (test) | Mean Rank26 | 6 | |
| Video demonstration retrieval | COIN CT (Gardening) (test) | MR (Mean Rank)25 | 6 | |
| Human Preference Evaluation | Cooking | Step Faithfulness Win Rate94 | 3 |