Pass Rate Comparison
pass@1, temp 0.0Score vs. Cost
Hard Task Performance
49 hardest tasksTasks solved by ≤2 providers
GrayGate Advantage
vs CompetitorsTasks GrayGate passed that others failed
Provider Kill Count
5 CompetitorsTasks GrayGate passes that stumped N others
| # | Model | Provider | Pass Rate | Passed | Runs | Tokens | Cost | $/Pass |
|---|---|---|---|---|---|---|---|---|
|
GrayGate
PIPELINE
+ Gemini 3 Flash
range 90.7–92.7%
|
Graygate |
91.6%
|
138/151 | 3 | — | Free | — |
|
|
gemini-3-pro-preview
range 59.6–60.3%
|
59.9%
|
90/151 | 2 | 56k | $0.52 | $0.006 | |
|
|
gemini-3-flash-preview
|
55.6%
|
84/151 | 3 | 90k | $0.24 | $0.003 | |
|
|
gpt-5.2
range 42.4–49.0%
|
Openai |
46.1%
|
70/151 | 3 | 57k | $0.64 | $0.009 |
|
|
qwen/qwen3.5-397b-a17b
range 37.8–39.1%
|
Qwen |
38.4%
|
58/151 | 3 | 149k | $0.14 | $0.002 |
|
|
deepseek-reasoner
range 31.1–33.8%
|
Deepseek |
32.5%
|
49/151 | 3 | 410k | $0.17 | $0.003 |
|
|
hf.co/Qiskit/qwen2.5-coder-14b-qiskit-GGUF:latest
|
Ollama |
25.8%
|
39/151 | 2 | 65k | Free | — |
How Scores Are Computed
Prompt Extraction
Each of the 151 tasks contains a function signature, docstring, and a hidden check() test harness.
Code Generation
The model receives the prompt and must return a complete function body. Temperature is fixed at 0.0 for deterministic output.
Sandboxed Execution
Generated code runs in an isolated venv with Qiskit 2.0.0 and pinned dependencies. 60 s timeout per task.
Test Validation
The hidden test harness runs assertions against the generated function. A task is PASS only if every assertion succeeds with zero errors.
Scoring
Final score = passed tasks / 151. Token counts and cost are tracked per-task for efficiency analysis.
Verified Execution
Every submission runs in a real Python process. No regex matching, no partial credit — the code either passes all tests or it doesn't.
Open Source
The evaluation framework, datasets, and scoring logic are fully open. Anyone can reproduce results or add new models.
Pinned Dependencies
Qiskit 2.0.0, numpy, scipy — every run uses the exact same package versions for fair comparison across models and dates.
Complex quantum algorithms, transpilation, error correction. The primary leaderboard benchmark.
Standard quantum computing tasks with Qiskit 2.0. Easier baseline for model capability assessment.
Research-level physics code generation. 190 tasks covering critical point calculations.
Run Your Own Benchmarks
Install GrayBench, point it at any provider, and get verified results in minutes. Open source, reproducible, and extensible.
$ git clone https://github.com/GrayArea-Labs/GrayBench.git
$ cd graybench && pip install -e .
$ graybench env setup qiskitbench
$ graybench keys set google
$ graybench run qiskitbench-hard -m google/gemini-2.5-flash