Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection

About

Detecting failures and identifying their root causes promptly and accurately is crucial for ensuring the availability of microservice systems. A typical failure troubleshooting pipeline for microservices consists of two phases: anomaly detection and root cause analysis. While various existing works on root cause analysis require accurate anomaly detection, there is no guarantee of accurate estimation with anomaly detection techniques. Inaccurate anomaly detection results can significantly affect the root cause localization results. To address this challenge, we propose BARO, an end-to-end approach that integrates anomaly detection and root cause analysis for effectively troubleshooting failures in microservice systems. BARO leverages the Multivariate Bayesian Online Change Point Detection technique to model the dependency within multivariate time-series metrics data, enabling it to detect anomalies more accurately. BARO also incorporates a novel nonparametric statistical hypothesis testing technique for robustly identifying root causes, which is less sensitive to the accuracy of anomaly detection compared to existing works. Our comprehensive experiments conducted on three popular benchmark microservice systems demonstrate that BARO consistently outperforms state-of-the-art approaches in both anomaly detection and root cause analysis.

Luan Pham, Huong Ha, Hongyu Zhang• 2024

Related benchmarks

TaskDatasetResultRank
Root Cause AnalysisRCAEval Overall All nine datasets (RE1OB-RE3TT) 1.0
Top-1 Accuracy19
9
Root Cause AnalysisRE3TT Train Ticket with code-level faults
F1@10.14
9
Root Cause AnalysisRE2TT (Train Ticket with multimodal data)
CPU Top-10.2
9
Root Cause AnalysisRE3OB Online Boutique with code-level faults
F1 Top-1 Accuracy0.00e+0
9
Root Cause AnalysisRE1TT Train Ticket unimodal data
CPU Top-10.12
8
Root Cause AnalysisRE1OB (Online Boutique) RCAEval benchmark unimodal data
CPU Top-1 Acc12
8
Root Cause AnalysisRE1SS (Sock Shop) unimodal data
CPU Top-140
8
Root Cause AnalysisRE3SS Sock Shop with code-level faults
F1 Top-10.2
8
Root Cause AnalysisRE2SS Sock Shop with multimodal data (test)
CPU Top-1 Accuracy0.00e+0
8
Showing 9 of 9 rows

Other info

Follow for update