Generate benchmark output:
java --add-modules jdk.incubator.vector -jar build/libs/gpt-oss-java-1.0.0-all.jar \
/path/to/gpt-oss-20b/original/model.safetensors \
--debug < /path/to/gpt-oss.java/benchmark/input_prompts.txt > /path/to/output_result.txtAnalyze results:
python3 /path/to/gpt-oss.java/benchmark/analyze_results.py --file /path/to/output_result.txt --groupAll performance benchmark are based on gpt-oss-20b.
Hardware:
- Host Type: MacBook Pro (Mac15,7)
- CPU: Apple M3 Pro
- CPU Cores: 12 cores total (6 performance + 6 efficiency)
- Memory (RAM): 36 GB
- Operating System: macOS 14.6.1 (Sonoma), Build 23G93
Java Runtime:
java version "24" 2025-03-18
Java(TM) SE Runtime Environment (build 24+36-3646)
Java HotSpot(TM) 64-Bit Server VM (build 24+36-3646, mixed mode, sharing)
Performance:
==========================================================================================
BENCHMARK RESULTS
==========================================================================================
Prompt Prompt Len Max Tokens Generated Gen Tok/s Prefill Tok/s
------------------------------------------------------------------------------------------
What happens if you pu... 10 100 100 9.125 12.084
Explain the difference... 11 100 100 9.174 12.061
Write a Python functio... 10 100 100 9.133 12.122
Why do people use umbr... 9 200 73 9.166 12.094
What are the main diff... 12 500 487 8.835 12.105
Write a short story ab... 7 500 500 8.790 12.051
Suggest three novel wa... 14 500 500 8.729 12.105
Hi team, I wanted quic... 999 1000 895 6.643 9.593
------------------------------------------------------------------------------------------
AVERAGE 8.699 11.777
Note that the benchmark program runs on a fresh Mac environment after startup with large available memory, which provides excellent mmap performance for MLP weight
files. Performance may degrade under high memory pressure or if other applications are competing memory. You may need to specify -Xmx16G or larger heap memory in JVM options.
Hardware:
- Host Type: AWS EC2 m5.4xlarge
- CPU: Intel Xeon Platinum 8175M @ 2.5 GHz (Skylake architecture)
- CPU Cores: 8 physical cores, 16 vCPUs
- Memory (RAM): 64 GB
- Operating System: Amazon Linux 2023
Java Runtime:
openjdk version "24.0.2" 2025-07-15
OpenJDK Runtime Environment (build 24.0.2+12-54)
OpenJDK 64-Bit Server VM (build 24.0.2+12-54, mixed mode, sharing)
Performance:
==========================================================================================
BENCHMARK RESULTS
==========================================================================================
Prompt Prompt Len Max Tokens Generated Gen Tok/s Prefill Tok/s
------------------------------------------------------------------------------------------
What happens if you pu... 10 100 100 7.307 10.311
Explain the difference... 11 100 100 7.338 10.382
Write a Python functio... 10 100 100 7.334 10.324
Why do people use umbr... 9 200 200 7.219 10.233
What are the main diff... 12 500 500 6.873 10.373
Write a short story ab... 7 500 500 6.880 10.126
Suggest three novel wa... 14 500 500 6.871 10.473
Hi team, I wanted quic... 999 1000 1000 4.405 7.307
------------------------------------------------------------------------------------------
AVERAGE 6.778 9.941
All tests are run with 16 threads with peak 93% CPU usage.
On the same machine, with MXFP4 model file:
-
Pytorch gpt-oss achieves ~0.04 tokens/s for decode phase using peak of 75% of CPU.
-
Huggingface transformers achieves ~3.4 tokens/s for decode phase and ~14 for prefill phase using almost 100% of CPU.
-
llama.cpp achieves 16.62 tokens/s for decode phase and 32.00 tokens/s for prefill phase when running MXFP4-quantized GGUF V3 models using almost 100% of CPU.
Performance:
build: 6708 (df1b612e) with cc (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7) for x86_64-amazon-linux llama_perf_sampler_print: sampling time = 270.28 ms / 971 runs ( 0.28 ms per token, 3592.61 tokens per second) llama_perf_context_print: load time = 46336.50 ms llama_perf_context_print: prompt eval time = 343.74 ms / 11 tokens ( 31.25 ms per token, 32.00 tokens per second) llama_perf_context_print: eval time = 57688.11 ms / 959 runs ( 60.15 ms per token, 16.62 tokens per second) llama_perf_context_print: total time = 64980.31 ms / 970 tokens llama_perf_context_print: graphs reused = 955 llama_memory_breakdown_print: | memory breakdown [MiB] | total free self model context compute unaccounted | llama_memory_breakdown_print: | - Host | 12048 = 11536 + 114 + 398 |