Robust Failure Diagnosis of Microservice System through Multimodal Data
About
Automatic failure diagnosis is crucial for large microservice systems. Currently, most failure diagnosis methods rely solely on single-modal data (i.e., using either metrics, logs, or traces). In this study, we conduct an empirical study using real-world failure cases to show that combining these sources of data (multimodal data) leads to a more accurate diagnosis. However, effectively representing these data and addressing imbalanced failures remain challenging. To tackle these issues, we propose DiagFusion, a robust failure diagnosis approach that uses multimodal data. It leverages embedding techniques and data augmentation to represent the multimodal data of service instances, combines deployment data and traces to build a dependency graph, and uses a graph neural network to localize the root cause instance and determine the failure type. Our evaluations using real-world datasets show that DiagFusion outperforms existing methods in terms of root cause instance localization (improving by 20.9% to 368%) and failure type determination (improving by 11.0% to 169%).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Root Cause Localization | D2 complete data conditions | Top-1 Accuracy58.2 | 7 | |
| Root Cause Localization | D1 complete data conditions | Top-1 Score31 | 7 | |
| Failure Triage | D1 complete data conditions | Precision67.5 | 6 | |
| Failure Triage | D2 complete data conditions | Precision79.7 | 6 |