ChaosRank takes your Jaeger trace export and incident history CSV and produces a ranked list of services to target, with a suggested fault type and confidence level for each.
The risk score combines two signals: - Blast radius: blended PageRank + in-degree centrality on the dependency graph (captures both deep chains and shallow-wide hubs) - Fragility: per-incident traffic-normalized severity with exponential decay (normalization order matters — post-hoc normalization produces ranking inversions at high traffic differentials)
Evaluated on the DeathStarBench social-network topology (31 services) from the UIUC/FIRM dataset (OSDI 2020). Found seeded weaknesses in 1 experiment on average vs 9.8 for random selection across 20 trials.
Output formats: Rich terminal table, JSON, and LitmusChaos ChaosEngine YAML (pipeable directly to kubectl apply).
To try it without your own traces — sample data is included:
pip install chaosrank-cli
git clone https://github.com/Medinz01/chaosrank
cd chaosrank
chaosrank rank \
--traces benchmarks/real_traces/social_network.json \
--incidents benchmarks/real_traces/social_network_incidents.csv
Known limitations: async dependencies (Kafka, SQS) don't appear in
trace spans so blast radius is underestimated for event-driven
architectures. Jaeger JSON only for now — OTel OTLP is next.Happy to discuss the algorithm design, particularly the PageRank direction choice and why per-incident normalization matters.