Skip to content

Latest commit

 

History

History
112 lines (95 loc) · 5.64 KB

File metadata and controls

112 lines (95 loc) · 5.64 KB

Benchmark

Running Performance Tests

Generate benchmark output:

java --add-modules jdk.incubator.vector -jar build/libs/gpt-oss-java-1.0.0-all.jar \
/path/to/gpt-oss-20b/original/model.safetensors \
--debug < /path/to/gpt-oss.java/benchmark/input_prompts.txt > /path/to/output_result.txt

Analyze results:

python3 /path/to/gpt-oss.java/benchmark/analyze_results.py --file /path/to/output_result.txt --group

Results

All performance benchmark are based on gpt-oss-20b.

MacOS - Apple M3 Pro

Hardware:

  • Host Type: MacBook Pro (Mac15,7)
  • CPU: Apple M3 Pro
  • CPU Cores: 12 cores total (6 performance + 6 efficiency)
  • Memory (RAM): 36 GB
  • Operating System: macOS 14.6.1 (Sonoma), Build 23G93

Java Runtime:

java version "24" 2025-03-18
Java(TM) SE Runtime Environment (build 24+36-3646)
Java HotSpot(TM) 64-Bit Server VM (build 24+36-3646, mixed mode, sharing)

Performance:

==========================================================================================
BENCHMARK RESULTS
==========================================================================================
Prompt                    Prompt Len Max Tokens Generated  Gen Tok/s  Prefill Tok/s
------------------------------------------------------------------------------------------
What happens if you pu... 10         100        100        9.125      12.084      
Explain the difference... 11         100        100        9.174      12.061      
Write a Python functio... 10         100        100        9.133      12.122      
Why do people use umbr... 9          200        73         9.166      12.094      
What are the main diff... 12         500        487        8.835      12.105      
Write a short story ab... 7          500        500        8.790      12.051      
Suggest three novel wa... 14         500        500        8.729      12.105      
Hi team, I wanted quic... 999        1000       895        6.643      9.593       
------------------------------------------------------------------------------------------
AVERAGE                                                    8.699      11.777    

Note that the benchmark program runs on a fresh Mac environment after startup with large available memory, which provides excellent mmap performance for MLP weight files. Performance may degrade under high memory pressure or if other applications are competing memory. You may need to specify -Xmx16G or larger heap memory in JVM options.

Linux - Intel Xeon Platinum 8175M

Hardware:

  • Host Type: AWS EC2 m5.4xlarge
  • CPU: Intel Xeon Platinum 8175M @ 2.5 GHz (Skylake architecture)
  • CPU Cores: 8 physical cores, 16 vCPUs
  • Memory (RAM): 64 GB
  • Operating System: Amazon Linux 2023

Java Runtime:

openjdk version "24.0.2" 2025-07-15
OpenJDK Runtime Environment (build 24.0.2+12-54)
OpenJDK 64-Bit Server VM (build 24.0.2+12-54, mixed mode, sharing)

Performance:

==========================================================================================
BENCHMARK RESULTS
==========================================================================================
Prompt                    Prompt Len Max Tokens Generated  Gen Tok/s  Prefill Tok/s
------------------------------------------------------------------------------------------
What happens if you pu... 10         100        100        7.307      10.311      
Explain the difference... 11         100        100        7.338      10.382      
Write a Python functio... 10         100        100        7.334      10.324      
Why do people use umbr... 9          200        200        7.219      10.233      
What are the main diff... 12         500        500        6.873      10.373      
Write a short story ab... 7          500        500        6.880      10.126      
Suggest three novel wa... 14         500        500        6.871      10.473      
Hi team, I wanted quic... 999        1000       1000       4.405      7.307       
------------------------------------------------------------------------------------------
AVERAGE                                                    6.778      9.941 

All tests are run with 16 threads with peak 93% CPU usage.

On the same machine, with MXFP4 model file:

  • Pytorch gpt-oss achieves ~0.04 tokens/s for decode phase using peak of 75% of CPU.

  • Huggingface transformers achieves ~3.4 tokens/s for decode phase and ~14 for prefill phase using almost 100% of CPU.

  • llama.cpp achieves 16.62 tokens/s for decode phase and 32.00 tokens/s for prefill phase when running MXFP4-quantized GGUF V3 models using almost 100% of CPU.

    Performance:

    build: 6708 (df1b612e) with cc (GCC) 14.2.1 20250110 (Red Hat 14.2.1-7) for x86_64-amazon-linux
    
    llama_perf_sampler_print:    sampling time =     270.28 ms /   971 runs   (    0.28 ms per token,  3592.61 tokens per second)
    llama_perf_context_print:        load time =   46336.50 ms
    llama_perf_context_print: prompt eval time =     343.74 ms /    11 tokens (   31.25 ms per token,    32.00 tokens per second)
    llama_perf_context_print:        eval time =   57688.11 ms /   959 runs   (   60.15 ms per token,    16.62 tokens per second)
    llama_perf_context_print:       total time =   64980.31 ms /   970 tokens
    llama_perf_context_print:    graphs reused =        955
    llama_memory_breakdown_print: | memory breakdown [MiB] | total   free     self   model   context   compute    unaccounted |
    llama_memory_breakdown_print: |   - Host               |                 12048 = 11536 +     114 +     398                |