Robust Failure Diagnosis of Microservice System through Multimodal Data

About

Automatic failure diagnosis is crucial for large microservice systems. Currently, most failure diagnosis methods rely solely on single-modal data (i.e., using either metrics, logs, or traces). In this study, we conduct an empirical study using real-world failure cases to show that combining these sources of data (multimodal data) leads to a more accurate diagnosis. However, effectively representing these data and addressing imbalanced failures remain challenging. To tackle these issues, we propose DiagFusion, a robust failure diagnosis approach that uses multimodal data. It leverages embedding techniques and data augmentation to represent the multimodal data of service instances, combines deployment data and traces to build a dependency graph, and uses a graph neural network to localize the root cause instance and determine the failure type. Our evaluations using real-world datasets show that DiagFusion outperforms existing methods in terms of root cause instance localization (improving by 20.9% to 368%) and failure type determination (improving by 11.0% to 169%).

Shenglin Zhang, Pengxiang Jin, Zihan Lin, Yongqian Sun, Bicheng Zhang, Sibo Xia, Zhengdan Li, Zhenyu Zhong, Minghua Ma, Wa Jin, Dai Zhang, Zhenyu Zhu, Dan Pei• 2023

Related benchmarks

Task	Dataset	Result
Root Cause Localization	D2 complete data conditions	Top-1 Accuracy58.2	7
Root Cause Localization	D1 complete data conditions	Top-1 Score31	7
Failure Triage	D1 complete data conditions	Precision67.5	6
Failure Triage	D2 complete data conditions	Precision79.7	6

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord