A Datacenter-Scale Simulation Framework for Energy- and SLO-Aware LLM Inference Serving
CloudSimLLM is a non-invasive extension of CloudSim Plus that adds first-class large-language-model (LLM) serving primitives — prefill/decode phases, PagedAttention KV cache, continuous batching, phase-aware GPU energy, Splitwise-style disaggregation, carbon-aware geo-routing, and burst auto-scaling — to a mature event-driven cloud simulator. The whole extension lives under one Java package (org.cloudsimplus.llm); deleting that package leaves CloudSim Plus exactly as shipped.
Figure 1 — CloudSimLLM class hierarchy. Grey italic boxes are upstream CloudSim Plus non-sealed abstract types (none modified). Yellow boxes are the eleven new CloudSimLLM classes; dashed yellow boxes are new value objects with no upstream parent. Hollow triangles indicate UML generalisation (extends).
| Concern | CloudSim Plus | CloudSimLLM (this work) |
|---|---|---|
| Workload | MIPS-Cloudlet | LlmCloudlet with (input, output, sloClass, phase) and a phase-aware lifecycle (WAITING → PREFILL → DECODE → DONE) |
| Hardware | Pe, Host |
GpuPe (peak/effective FP16 + HBM, TDP/idle) and GpuHost (NVLink + inter-host fabric, region) |
| Scheduling | CloudletSchedulerAbstract |
ContinuousBatchScheduler (drives Algorithm 1: admission, phase choice, paged KV) and PrefillDecodeDisaggScheduler (Splitwise role) |
| Brokers | DatacenterBrokerSimple |
PrefillDecodeBroker (decode-shadow handoff), CarbonAwareBroker (3 routing policies + GeoRegion), AutoscalingBroker (warm pool + reactive/predictive policies) |
| Power & resources | PowerModelHostAbstract, ResourceProvisionerAbstract |
LlmPowerModel (POLCA-style ρ_dec factor) and KvCacheProvisioner (PagedAttention block list) |
| Allocation | VmAllocationPolicyAbstract |
LlmVmAllocationPolicy (HBM-aware, SKU-pinning) |
Twelve calibrated equations capture the LLM workload's resource dynamics in closed form (paper §4); six of those parameters are recovered from vLLM-style measurements via scipy.optimize.curve_fit (paper §6.1).
The four bundled case studies in tools/experiments/ show what the simulator is for:
- §6.3 Splitwise disaggregation — sweep prefill/decode GPU ratios, prompt distributions, and KV-fabric bandwidths at constant 8-GPU budget. Result: a 5-seed-mean 5.12× P99 TTFT speedup on medium prompts but a 1.42× slowdown on long prompts at fixed budget; KV bandwidth above 100 GB/s is essentially flat.
- §6.4 Heterogeneous GPU mix — sweep all (A100, H100, L40S) compositions across workloads. Result: homogeneous fleets dominate the cost-latency frontier in our setup; mixes are rarely Pareto-optimal at constant device count.
- §6.5 Carbon-aware geo-routing — three regions × three policies × 24 hours × three workloads. Result: pure
CarbonAwarerouting cuts emissions by 84–91 % but pays a 2×–4× TTFT penalty on medium and long prompts; theBLENDED(α)policy exposes the trade-off. - §6.6 Bursty workload auto-scaling —
STATIC/REACTIVE/PREDICTIVE× low/med/high bursts × 5/15/30 s cold-starts. Result: elastic policies match the static baseline's SLO attainment at ≈49 % lower active VM-hours (read as a cost-equivalence study; absolute SLO attainment is low at the chosen TTFT thresholds).
Total: 990 cells (198 configurations × 5 seeds), ≈108 s on a laptop (8-core Apple M2, JDK 25). An equivalent on-hardware sweep would take ≈50 h of GPU-time per seed — a ≈2600× speedup.
git clone https://github.com/FUJIANUT/CloudSimLLM.git
cd CloudSimLLM
./mvnw -DskipTests=true package # JDK 25 requiredJAVA_HOME=$(/usr/libexec/java_home -v 25) ./mvnw -q exec:java \
-Dexec.mainClass=org.cloudsimplus.llm.example.SplitwiseExample# 1. Run the four parametric sweeps (Java sim) → CSV
python3 tools/experiments/run_splitwise_sweep.py \
--output tools/experiments/results/splitwise_sweep.csv \
--seeds "42,43,44,45,46" --jobs 4
python3 tools/experiments/run_heterogeneous_sweep.py \
--output tools/experiments/results/heterogeneous_sweep.csv \
--seeds "42,43,44,45,46" --jobs 4
python3 tools/experiments/run_geo_sweep.py \
--output tools/experiments/results/geo_sweep.csv \
--seeds "42,43,44,45,46" --jobs 4
python3 tools/experiments/run_autoscale_sweep.py \
--output tools/experiments/results/autoscale_sweep.csv \
--seeds "42,43,44,45,46" --jobs 4
# 2. Generate figures (with 5-seed error bars) and summary tables
python3 tools/analysis/case_study_1_cli.py \
--results tools/experiments/results/splitwise_sweep.csv --outdir tools/analysis/figures/
python3 tools/analysis/case_study_2_cli.py \
--results tools/experiments/results/heterogeneous_sweep.csv --outdir tools/analysis/figures/
python3 tools/analysis/case_study_3_cli.py \
--results tools/experiments/results/geo_sweep.csv --outdir tools/analysis/figures/
python3 tools/analysis/case_study_4_cli.py \
--results tools/experiments/results/autoscale_sweep.csv --outdir tools/analysis/figures/Outputs land in tools/analysis/figures/: fig6–fig17 PDF/PNG and table_case_study_{1..4}.{csv,tex}. Every numeric metric is reported as mean ± standard deviation across the 5 seeds.
python3 tools/calibration/run_full_calibration.py \
--output tools/calibration/_calibration_run/The pipeline (a) consumes vLLM-style benchmark measurements, (b) samples GPU power, and (c) fits the six effective parameters F^pre_eff, F^dec_eff, B^eff_mem, α_pre, α_dec, ρ_dec via constrained scipy.optimize.curve_fit. The current release is calibrated against published vLLM, POLCA, and Splitwise measurements (literature-derived ground truth); the same interface accepts direct on-hardware vLLM measurements as a drop-in replacement.
src/main/java/org/cloudsimplus/llm/
├── core/ GpuPe, GpuHost, LlmCloudlet
├── workload/ LlmModelSpec, KvCacheProvisioner, KvCacheBlock
├── scheduler/ ContinuousBatchScheduler, PrefillDecodeBroker,
│ PrefillDecodeDisaggScheduler, LlmVmAllocationPolicy
├── geo/ CarbonAwareBroker, GeoRegion
├── autoscale/ AutoscalingBroker, WarmPoolAutoscaler
├── power/ LlmPowerModel
├── trace/ AzureLlmTraceReader, BurstGptTraceReader
├── metrics/ LlmStatistics
└── example/ SplitwiseExample, SplitwiseSweepRunner,
HeterogeneousMixRunner, GeoDistributedRunner,
AutoscalingRunner, LlmExample
tools/
├── calibration/ vLLM benchmark, GPU power sampler, curve-fit pipeline
├── experiments/ 4 sweep harnesses + 198-configuration result CSVs (5 seeds)
└── analysis/ Figure-generation scripts and notebooks
The rest of the tree is upstream CloudSim Plus (unmodified). The original upstream README is preserved at README.cloudsimplus-upstream.md.
- ✅ Java extension + Python toolchain released (this branch).
- ✅ 5-seed sweeps (990 cells) reproducible end-to-end on a laptop.
- ✅ Calibration toolchain works; current release uses literature-derived ground truth.
- ⏳ Direct on-hardware vLLM calibration (A100 / H100 / L40S) is the next validation step.
- ⏳ Comparison against production autoscalers (Kubernetes HPA, KEDA, SageServe) is future work.
If you use CloudSimLLM in your research, please cite:
@article{cloudsimllm2026,
title = {CloudSimLLM: A Datacenter-Scale Simulation Framework for
Energy- and SLO-Aware LLM Inference Serving},
author = {Jiang, Chunmao and Ye, Ruyi and Zhang, Hao},
journal = {Future Generation Computer Systems},
year = {2026},
note = {Under review}
}Please also cite the upstream CloudSim Plus paper:
@inproceedings{silva2017cloudsimplus,
title = {CloudSim Plus: A Cloud Computing Simulation Framework Pursuing
Software Engineering Principles for Improved Modularity,
Extensibility and Correctness},
author = {Silva Filho, M. C. and Oliveira, R. L. and Monteiro, C. C. and
In{\'a}cio, P. R. M. and Freire, M. M.},
booktitle = {IFIP/IEEE International Symposium on Integrated Network Management},
year = {2017}
}CloudSimLLM is released under the GPL-3.0 license, matching upstream CloudSim Plus. See LICENSE.
We thank the CloudSim Plus maintainers for the extensible event-driven simulation foundation, and the vLLM community for the reference implementations against which CloudSimLLM is calibrated.