FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing

About

Textual scene graph parsing has become increasingly important in various vision-language applications, including image caption evaluation and image retrieval. However, existing scene graph parsers that convert image captions into scene graphs often suffer from two types of errors. First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness. Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations. To address these challenges, we propose a novel dataset, which involves re-annotating the captions in Visual Genome (VG) using a new intermediate representation called FACTUAL-MR. FACTUAL-MR can be directly converted into faithful and consistent scene graph annotations. Our experimental results clearly demonstrate that the parser trained on our dataset outperforms existing approaches in terms of faithfulness and consistency. This improvement leads to a significant performance boost in both image caption evaluation and zero-shot image retrieval tasks. Furthermore, we introduce a novel metric for measuring scene graph similarity, which, when combined with the improved scene graph parser, achieves state-of-the-art (SOTA) results on multiple benchmark datasets for the aforementioned tasks. The code and dataset are available at https://github.com/zhuang-li/FACTUAL .

Zhuang Li, Yuyang Chai, Terry Yue Zhuo, Lizhen Qu, Gholamreza Haffari, Fei Li, Donghong Ji, Quan Hung Tran• 2023

Related benchmarks

Task	Dataset	Result
Image Captioning Evaluation	Flickr8K Expert (test)	Kendall tau_c0.5335	76
Video Captioning Evaluation Correlation	VATEX Eval	Kendall's Tau-b36.31	40
Image Caption Evaluation	FOIL (4-ref)	Accuracy94.64	15
Image Caption Evaluation	FOIL 1-ref	Accuracy90.69	15
Image Retrieval	Random (test)	Recall@179.39	10
Image Retrieval	Length (test)	Recall@175	10
Image Caption Evaluation	Flicker8k (test)	tau_c57.37	7
Scene Graph Parsing	Random (test)	Set Match79.77	6
Scene Graph Parsing	Length (test)	Set Match4.24e+3	6
Scene Graph Parsing	FACTUAL (test)	Completeness0.92	5

Showing 10 of 10 rows

Other info

Code

Follow for update

@wizwand_team Discord