FusionSeg: Learning to combine motion and appearance for fully automatic segmention of generic objects in videos

About

We propose an end-to-end learning framework for segmenting generic objects in videos. Our method learns to combine appearance and motion information to produce pixel level segmentation masks for all prominent objects in videos. We formulate this task as a structured prediction problem and design a two-stream fully convolutional neural network which fuses together motion and appearance in a unified framework. Since large-scale video datasets with pixel level segmentations are problematic, we show how to bootstrap weakly annotated videos together with existing image recognition datasets for training. Through experiments on three challenging video segmentation benchmarks, our method substantially improves the state-of-the-art for segmenting generic (unseen) objects. Code and pre-trained models are available on the project website.

Suyog Dutt Jain, Bo Xiong, Kristen Grauman• 2017

Related benchmarks

Task	Dataset	Result
Video Object Segmentation	DAVIS 2016 (val)	J Mean70.7	564
Video Object Segmentation	DAVIS	--	128
Unsupervised Video Object Segmentation	DAVIS 2016 (val)	F Mean65.3	108
Unsupervised Video Object Segmentation	FBMS (test)	J Mean68.4	66
Unsupervised Video Object Segmentation	DAVIS 2016 (test)	J Mean70.7	50
Video Object Segmentation	DAVIS 2016	J-Measure70.7	50
Video Object Segmentation	YouTube-Objects	mIoU68.4	50
Video Object Segmentation	FBMS (test)	J-measure68.4	42
Video Object Segmentation	SegTrack v2 (test)	J Mean61.4	40
Video Object Segmentation	YoutubeObjects (val)	mIoU68.4	35

Showing 10 of 27 rows

Other info

Follow for update

@wizwand_team Discord